Home

Awesome

<div align="center"> <img src="img/logo.png"><br> <h3>Protein Structures Voxelisation for Deep Learning</h3><br> </div>

CI

aposteriori is a library for the voxelization of protein structures for protein design. It uses conventional PDB files to create fixed discretized areas of space called "frames". The atoms belonging to the side-chain of the residues are removed so to allow a Deep Learning classifier to determine the identity of the frames based solely on the protein backbone structure.

<div align="center"> <img src="img/voxelisation.png"><br> </div>

Installation

PyPI

pip install aposteriori

Manual Install

Change directory to the aposteriori folder if you have not done so already:

git clone https://github.com/wells-wood-research/aposteriori/tree/master
cd aposteriori/

Install aposteriori

pip install .

Creating a Dataset

There are two ways to create a dataset using aposteriori: through the Python API in aposteriori.make_frame_dataset or using the command line tool make-frame-dataset that installs along side the module:

make-frame-dataset /path/to/folder

If you want to try out an example, run:

make-frame-dataset tests/testing_files/pdb_files/

Check the make-frame-dataset help page for more details on its usage:

Usage: make-frame-dataset [OPTIONS] STRUCTURE_FILE_FOLDER

  Creates a dataset of voxelized amino acid frames.

  A frame refers to a region of space around an amino acid. For every
  residue in the input structure(s), a cube of space around the region (with
  an edge length equal to `--frame_edge_length`, default 12 Å), will be
  mapped to discrete space, with a defined number of voxels per edge (equal
  to `--voxels-per-side`, default = 21).

  Basic Usage:

  `make-frame-dataset $path_to_folder_with_pdb/`

  eg. `make-frame-dataset tests/testing_files/pdb_files/`

  This command will make a tiny dataset in the current directory
  `test_dataset.hdf5`, containing all residues of the structures in the
  folder.

  Globs can be used to define the structure files to be processed. `make-
  frame-dataset pdb_files/**/*.pdb` would include all `.pdb` files in all
  subdirectories of the `pdb_files` directory.

  You can process gzipped pdb files, but the program assumes that the format
  of the file name is similar to `1mkk.pdb.gz`. If you have more complex
  requirements than this, we recommend using this library directly from
  Python rather than through this CLI.

  The hdf5 object itself is like a Python dict. The structure is simple:
  
    └─[pdb_code] Contains a number of subgroups, one for each chain.
      └─[chain_id] Contains a number of subgroups, one for each residue.
        └─[residue_id] voxels_per_side^3 array of ints, representing element number.
          └─.attrs['label'] Three-letter code for the residue.
          └─.attrs['encoded_residue'] One-hot encoding of the residue.
    └─.attrs['make_frame_dataset_ver']: str - Version used to produce the dataset.
    └─.attrs['frame_dims']: t.Tuple[int, int, int, int] - Dimentsions of the frame.
    └─.attrs['atom_encoder']: t.List[str] - Lables used for the encoding (eg, ["C", "N", "O"]).
    └─.attrs['encode_cb']: bool - Whether a Cb atom was added at the avg position of (-0.741287356, -0.53937931, -1.224287356).
    └─.attrs['atom_filter_fn']: str - Function used to filter the atoms in the frame.
    └─.attrs['residue_encoder']: t.List[str] - Ordered list of residues corresponding to the encoding used.
    └─.attrs['frame_edge_length']: float - Length of the frame in Angstroms (A)
    └─.attrs['voxels_as_gaussian']: bool - Whether the voxels are encoded as a floating point of a gaussian (True) or boolean (False)

  So hdf5['1ctf']['A']['58'] would be an array for the voxelized.

Options:
Options:
  -o, --output-folder PATH        Path to folder where output will be written.
                                  Default = `.`

  -n, --name TEXT                 Name used for the dataset file, the `.hdf5`
                                  extension does not need to be included as it
                                  will be appended. Default = `frame_dataset`

  -e, --extension TEXT            Extension of structure files to be included.
                                  Default = `.pdb`.

  --pieces-filter-file PATH       Path to a Pieces format file used to filter
                                  the dataset to specific chains inspecific
                                  files. All other PDB files included in the
                                  input will be ignored.

  --frame-edge-length FLOAT       Edge length of the cube of space around each
                                  residue that will be voxelized. Default =
                                  12.0 Angstroms.

  --voxels-per-side INTEGER       The number of voxels per side of the frame.
                                  This will give a final cube of `voxels-per-
                                  side`^3. Default = 21.

  -p, --processes INTEGER         Number of processes to be used to create the
                                  dataset. Default = 1.

  -z, --is_pdb_gzipped            If True, this flag indicates that the
                                  structure files are gzipped. Default =
                                  False.

  -r, --recursive                 If True, all files in all subfolders will be
                                  processed.

  -v, --verbose                   Sets the verbosity of the output, use `-v`
                                  for low level output or `-vv` for even more
                                  information.

  -cb, --encode_cb BOOLEAN        Encode the Cb at an average position
                                  (-0.741287356, -0.53937931, -1.224287356) in
                                  the aligned frame, even for Glycine
                                  residues. Default = True

  -ae, --atom_encoder [CNO|CNOCB|CNOCBCA]
                                  Encodes atoms in different channels,
                                  depending on atom types. Default is CNO,
                                  other options are ´CNOCB´ and `CNOCBCA` to
                                  encode the Cb or Cb and Ca in different
                                  channels respectively.  [required]

  -d, --download_file PATH        Path to csv file with PDB codes to be
                                  voxelised. The biological assembly will be
                                  used for download. PDB codes will be
                                  downloaded the /pdb/ folder.

  -g, --voxels_as_gaussian BOOLEAN
                                  Boolean - whether to encode voxels as
                                  gaussians (True) or voxels (False). The
                                  gaussian representation uses the
                                  wanderwaal's radius of each atom using the
                                  formula e^(-x^2) where x is Vx - x)^2 + (Vy
                                  - y)^2) + (Vz - z)^2)/ r^2 and  (Vx, Vy, Vz)
                                  is the position of the voxel in space. (x,
                                  y, z) is the position of the atom in space,
                                  r is the Van der Waal’s radius of the atom.
                                  They are then normalized to add up to 1.

  -b, --blacklist_csv PATH        Path to csv file with structures to be
                                  removed.

  -comp, --compression_gzip BOOLEAN
                                  Whether to comrpess the dataset with gzip
                                  compression.

  -vas, --voxelise_all_states BOOLEAN
                                  Whether to voxelise only the first state of
                                  the NMR structure (False) or all of them
                                  (True).

  -rot, --tag_rotamers BOOLEAN    Whether to tag rotamer information to the
                                  frame (True) or not (False).

  --help                          Show this message and exit.

Example 1: Create a Dataset Using Biological Units of Proteins

Ideally, if you are trying to solve the Inverse Protein Folding Problem , you should use Biological Units as they are the minimal functional part of a protein. This prevents having solvent-exposed hydrophobic residues as training data.

Download the dataset:

To read more about biological units: https://pdbj.org/help/about-aubu and https://pdb101.rcsb.org/learn/guide-to-understanding-pdb-data/biological-assemblies

Once the dataset is downloaded, you will have a directory with sub-directory containig the gzipped PDB structures (ie. your Protein Data Bank Files).

To voxelize the structures into frames, run:

make-frame-dataset /path/to/biounits/  -e .pdb1.gz 

If everything went well, you should be seeing the number of structures that will be voxelised and a list of default parameters, to which you will press "y " to proceed.

Example 2: Create a Dataset Using Biological Units of Proteins and PISCES

PISCES (Protein Sequence Culling Server) is a curated subset of protein structures. Each file contains a list of structures with parameters such as resolution, percentage identity and R-Values.

Aposteriori supports filtering with a PISCES file as such:

make-frame-dataset /path/to/biounits/  -e .pdb1.gz --pieces-filter-file
 path/to/pisces/cullpdb_pc90_res1.6_R0.25_d190114_chains8082

If everything went well, you should be seeing the number of structures that will be voxelised and a list of default parameters, to which you will press "y " to proceed.

Development

The easiest way to install a development version of aposteriori is using Conda:

Conda

Create the environment:

conda create -n aposteriori python=3.8

Activate it and clone the repository:

conda activate aposteriori
git clone https://github.com/wells-wood-research/aposteriori.git
cd aposteriori/

Install dependencies:

pip install -r dev-requirements.txt

Install aposteriori:

pip install .

Check that aposteriori works

 make-frame-dataset --help

Make sure you test your install:

pytest tests/

Pip (only)

Alternatively you can install the repository with pip:

git clone https://github.com/wells-wood-research/aposteriori.git
cd aposteriori/
pip install -r dev-requirements.txt

Install aposteriori:

pip install .