Home

Awesome

LEADD

About

LEADD (Lamarckian Evolutionary Algorithm for de novo Drug Design) is a tool for molecular design and optimization.

Molecules are represented as meta-graphs of molecular fragments. Fragments are extracted by fragmenting an input virtual library, and broken bonds are converted to labelled attachment points (a.k.a. connectors). Molecules are reassembled by genetic operators that combine fragments through said connectors. Knowledge-based connectivity rules, extracted from the same input library, are enforced throughout the process. A population of molecules is evolved stochastically through use of genetic operators, with the goal of optimizing a user-provided scoring function. A Lamarckian evolutionary mechanism adjusts the reproductive behaviour of the molecules based on the outcome of the previous generation.

A more detailed description of the algorithm can be found in the corresponding paper: Kerstjens, A., De Winter, H. LEADD: Lamarckian evolutionary algorithm for de novo drug design. J Cheminform 14, 3 (2022). https://doi.org/10.1186/s13321-022-00582-y.

Authors

Dependencies

It's recommended to also build the Python bindings. On top of the above, you will need the following:

If you are interested in re-running the benchmarks you will also need:

The versions listed above have been tested to work on both Ubuntu 20.04 and Windows 10. Older versions may also work, with the exception of the RDKit due to a bug in older versions.

Installation

The following sections assume you are installing LEADD on a GNU/Linux machine. If this isn't the case you may have to adapt these instructions slightly.

Clone this repository (along with the pybind11 submodule). The resulting directory will be referred to as ${LEADD}.

git clone --recurse-submodules https://github.com/UAMCAntwerpen/LEADD.git

Install the rest of the dependencies. If you don't mind a bloated installation you can install them into an Anaconda environment using the provided leadd_conda_env.yml file.

cd ${LEADD}
conda env create -f leadd_conda_env.yml
conda activate LEADD

If you installed the RDKit through Anaconda you can build LEADD as follows:

mkdir build && cd build
cmake -DRDKit_INCLUDE_DIRS=${CONDA_PREFIX}/include/rdkit -DRDKit_LIBRARY_DIRS=${CONDA_PREFIX}/lib ..
make install clean

Alternatively, if you installed the RDKit from source you should have an environment variable ${RDBASE} pointing to the RDKit's root directory:

mkdir build && cd build
cmake -DRDKit_INCLUDE_DIRS=${RDBASE}/Code -DRDKit_LIBRARY_DIRS=${RDBASE}/lib ..
make install clean

If you don't want to build the Python bindings add the -DBUILD_PYTHON_BINDINGS=OFF flag to the CMake commands. Otherwise you probably want to add ${LEADD}/lib to your ${PYTHONPATH} environment variable, for instance in your .bashrc file.

export PYTHONPATH="${PYTHONPATH}:${LEADD}/lib"

Example workflow

If you'd like to follow along you can find most referenced files in the example directory. Most executables are parallelized with OpenMP and will by default use all cores of your CPU. You can change this behaviour by setting the OMP_NUM_THREADS environment variable before running the command.

The first step is to create the fragments' SQLite3 database. In this example we will fragment PGK1_ligands.smi, splitting acyclic regions into atomic fragments and defining connectors with MMFF94 atom types.

cd ${LEADD}/example
${LEADD}/bin/Fragment -i PGK1_ligands.smi -o fragments.db -s fragmentation_settings.txt

Thereafter, we precompute which fragments are compatible with each connector and store the output in a file.

${LEADD}/bin/PrecalculateConnectionQueryResults -i fragments.db -o fragments.cqr -t reconstruction_settings.txt

If you want to enable guided evolution you will need to generate a fragments similarity matrix. If your fragments database is very large, think twice before creating this matrix as it can be very expensive!

${LEADD}/bin/MakeFragmentSimilarityMatrix -i fragments.db -o fragments.h5

If you would like to use the SAScore filter or score heuristic you will also need a feature count library. For this you can use any collection of drug-like molecules, for instance CHEMBL.

${LEADD}/bin/MakeFeatureLibrary -i CHEMBL.smi -o CHEMBL.fl

Now you are ready to run LEADD itself. Make sure the absolute paths to the files we just created are within your reconstruction_settings.txt file. You'll have to adapt the provided settings file a bit.

If you are using LEADD programatically with the Python bindings make sure your Python interpreter can find the LEADD shared library, for instance by adding ${LEADD}/lib to your ${PYTHONPATH}. The provided example script will attempt to rediscover one of the ligands we just fragmented:

python ${LEADD}/example/leadd_example.py reconstruction_settings.txt . -v

Alternatively, you can run LEADD as a standalone executable. Normally this requires coupling a scoring function. For this example we can use the included topological similarity scoring function to rediscover the same ligand:

${LEADD}/bin/StandaloneLEADD -s reconstruction_settings.txt -o . -v

Remember that LEADD is a stochastic algorithm. Try running it a couple of times to see what happens.

Settings

This section covers how you can configure LEADD through its settings files. For an executable's options run it from the command line with the --help flag.

The settings files are plain text files where each line consists of a KEY VALUE pair separated by a single space. Most KEYs are optional. If an optional KEY is omitted or if the VALUE is blank LEADD will use a default VALUE instead (see source code). If the KEY is required by the executable LEADD will throw an error. Unless otherwise specified, a KEY should be assumed to be optional. It's recommended to specify all KEYs, including optional ones, to avoid unexpected behaviour. Empty lines and lines starting with # aren't parsed.

Fragmentation settings

A template fragmentation settings file is provided with this distribution. For default setting values users should reference the source code.

KeyTypeValid valuesDescription
ATOM_TYPING_SCHEMEEnumDUMMY, ATOMIC_NUMBER, MMFF, MORGAN, HASHED_MORGANAtom typing scheme used to define connections
SYSTEMATIC_FRAGMENTATION_SCHEMEEnumNONE, LINEAR, SUBGRAPH, CIRCULARFragmentation scheme for molecular regions treated systematically (normally acyclic regions)
MIN_FRAGMENT_SIZEInteger>= 0Minimum systematic fragment size (in number of heavy atoms). Only used when SYSTEMATIC_FRAGMENTATION_SCHEME != NONE
MAX_FRAGMENT_SIZEInteger>= 0 && >= MIN_FRAGMENT_SIZEMaximum systematic fragment size (in number of heavy atoms). Only used when SYSTEMATIC_FRAGMENTATION_SCHEME != NONE
FRAGMENT_RINGSBoolean0 or 1Flag to fragment cyclic regions systematically. For testing purposes only

Morgan atom typing settings

KeyTypeValid valuesDescription
MORGAN_RADIUSInteger>= 0Circular atomic environment radius used to define Morgan atom types. Only used when ATOM_TYPING_SCHEME == MORGAN or ATOM_TYPING_SCHEME == HASHED_MORGAN
MORGAN_CONSIDER_CHIRALITYBoolean0 or 1Flag to use atom chirality as atomic invariant during Morgan atom type calculation. Only used when ATOM_TYPING_SCHEME == MORGAN or ATOM_TYPING_SCHEME == HASHED_MORGAN
HASHED_MORGAN_N_BITSInteger> 0Modulo used to collapse raw Morgan feature IDs into Morgan atom types. Only used when ATOM_TYPING_SCHEME == HASHED_MORGAN

Reconstruction settings

A template reconstruction settings file is provided with this distribution. For default setting values users should reference the source code.

Input file settings

These settings are required by LEADD.

KeyTypeValid valuesDescription
FRAGMENT_DATABASE_FILEStringAbsolute pathsPath to the SQLite3 database generated by Fragment
CONNECTION_QUERY_RESULTS_FILEStringAbsolute pathsPath to the .cqr generated by PrecalculateConnectionQueryResults

Fragment sampling settings

KeyTypeValid valuesDescription
ACYCLIC_FREQUENCY_GAMMAFloatAnyExponent applied to acyclic fragment frequencies when calculating their sampling weight
ACYCLIC_LEVEL_GAMMAFloatAnyExponent applied to acyclic fragment sizes when calculating their sampling weight
RING_FREQUENCY_GAMMAFloatAnyExponent applied to cyclic fragment frequencies when calculating their sampling weight
RING_LEVEL_GAMMAFloatAnyExponent applied to cyclic fragment sizes when calculating their sampling weight
SCORE_GAMMAFloatAnyExponent applied to molecules' scores when calculating fragment sampling weights in transfections and parent sampling weights

Genetic operation probabilities

KeyTypeValid valuesDescription
PERIPHERAL_EXPANSION_PROBABILITYInteger>= 0Sampling weight of the peripheral expansion operator
PERIPHERAL_DELETION_PROBABILITYInteger>= 0Sampling weight of the peripheral deletion operator
PERIPHERAL_SUBSTITUTION_PROBABILITYInteger>= 0Sampling weight of the peripheral substitution operator
PERIPHERAL_TRANSFECTION_PROBABILITYInteger>= 0Sampling weight of the peripheral transfection operator
INTERNAL_EXPANSION_PROBABILITYInteger>= 0Sampling weight of the internal expansion operator
INTERNAL_DELETION_PROBABILITYInteger>= 0Sampling weight of the internal deletion operator
INTERNAL_SUBSTITUTION_PROBABILITYInteger>= 0Sampling weight of the internal substitution operator
INTERNAL_TRANSFECTION_PROBABILITYInteger>= 0Sampling weight of the internal transfection operator
TRANSLATION_PROBABILITYInteger>= 0Sampling weight of the translation operator
STEREO_FLIP_PROBABILITYInteger>= 0Sampling weight of the stereo flip operator

Designed molecule settings

KeyTypeValid valuesDescription
N_RING_ATOMS_MEANInteger> 0Mean number of ring atoms in designed molecules
N_RING_ATOMS_STDEVFloat> 0Standard deviation in the number of ring atoms in designed molecules
MAX_N_RING_ATOMSInteger> 0Maximum allowed number of ring atoms
PEAK_N_RING_ATOMS_PROBABILITYFloat0 < v < 1Probability of keeping the number of ring atoms constant at N_RING_ATOMS_MEAN
MIN_SEED_SIZEInteger> 0Minimum number of heavy atoms in random starting molecules
ASSIGN_UNSPECIFIED_STEREOBoolean0 or 1Flag to assign random stereochemistry to unspecified chiral centers and stereochemical double bonds

Evolution settings

KeyTypeValid valuesDescription
PRNG_SEEDInteger>= 0Pseudo-random number generator seed. 0 is reserved for random seeds
N_SEEDSInteger> 0Number of random starting molecules
N_GENERATIONSInteger> 0Maximum number of generations
N_CHILDREN_PER_GENERATIONInteger> 0Number of children bred each generation
N_SURVIVORS_PER_GENERATIONInteger> 0Number of molecules advancing to the next generation
MAX_CHILD_SIMILARITYFloat0 < v < 1Maximum topological similarity between any two molecules of the population
TERMINATION_SCOREFloat> 0Population score at which the process terminates
AVERAGE_SCORE_AS_TERMINATION_CRITERIONBoolean0 or 1Flag to use the average (instead of best) molecule score as TERMINATION_SCORE
MAX_GENERATIONS_STUCKInteger>= 0Maximum number of generations without improvement in the TERMINATION_SCORE
MAX_ATTEMPTS_PER_GENERATIONInteger>= 0Maximum number of attempts at evolving a molecule before it's skipped

Synthetic accessibility settings

The SAScore filter will be employed if the FEATURE_LIBRARY_FILE is specified and MAX_SASCORE < 10. The SAScore heuristic will be employed if FEATURE_LIBRARY_FILE is specified and USE_SASCORE_HEURISTIC == 1. Both can be used simultaneously.

KeyTypeValid valuesDescription
FEATURE_LIBRARY_FILEStringAbsolute pathsPath to the .fl generated by MakeFeatureLibrary
MAX_SASCOREFloat0 < v <= 10Maximum SAScore enforced by the SAScore filter
SASCORE_HEURISTIC_MUFloat> 0Parameter of the SAScore heuristic Gaussian function
SASCORE_HEURISTIC_SIGMAFloat> 0Parameter of the SAScore heuristic Gaussian function
USE_SASCORE_HEURISTICBoolean0 or 1Flag to enable the SAScore heuristic

Guided evolution settings

Guided evolution will be enabled when SIMILARITY_MATRIX_FILE is specified and at least one of ACYCLIC_POSITIVE_REINFORCEMENT, ACYCLIC_NEGATIVE_REINFORCEMENT, RING_POSITIVE_REINFORCEMENT or RING_NEGATIVE_REINFORCEMENT are > 0.

KeyTypeValid valuesDescription
SIMILARITY_MATRIX_FILEStringAbsolute pathsPath to the HDF5 file generated by MakeFragmentSimilarityMatrix
ACYCLIC_LEARNING_SIMILARITY_THRESHOLDFloat>= 0Minimum topological similarity between two acyclic fragments for one to trigger weight adjustments of the other
RING_LEARNING_SIMILARITY_THRESHOLDFloat>= 0Minimum topological similarity between two cyclic fragments for one to trigger weight adjustments of the other
ACYCLIC_POSITIVE_REINFORCEMENTFloat>= 0Reinforcement factor when increasing acyclic fragments sampling weights
ACYCLIC_NEGATIVE_REINFORCEMENTFloat>= 0Reinforcement factor when decreasing acyclic fragments sampling weights
RING_POSITIVE_REINFORCEMENTFloat>= 0Reinforcement factor when increasing cyclic fragments sampling weights
RING_NEGATIVE_REINFORCEMENTFloat>= 0Reinforcement factor when decreasing cyclic fragments sampling weights

Restart settings

By default LEADD generates a random starting population. Alternatively, users may specify their own starting populations. This can be done in two ways:

Restart .rst files can be generated either by LEADD itself (if RESTART_OUTPUT_FILE is specified), by the ConvertToReconstruction executable or programatically with the MakePopulationFromSMILES function.

However, the user should be aware of the limitations of the latter two approaches. Since LEADD operates with meta-graphs, which are more information rich than regular molecular graphs, some compromises had to be made:

KeyTypeValid valuesDescription
RESTART_INPUT_FILEStringAbsolute pathsPath to an input .rst file generated either by ConvertToReconstruction or LEADD
RESET_WEIGHTS_ON_RESTARTBoolean0 or 1Flag to reset connector weight arrays when restarting LEADD
RESTART_OUTPUT_FILEStringAbsolute pathsPath to an output .rst file written by LEADD
N_GENERATIONS_PER_SAVEInteger> 0Number of generations between writing a .rst file

Scoring settings

LEADD comes out of the box with a topological similarity scoring function which will be employed if only TEMPLATE_MOL_SMILES is specified. This scoring function should only be used for testing purposes. For real drug discovery projects you should couple your own scoring function. The recommended way of doing this is programatically. A Python example of this can be found in this repository.

As a legacy way, you can also use the StandaloneLEADD executable. In this case, you must specify in the settings file SCORING_FUNCTION_INPUT_FILE, SCORING_FUNCTION_CALL and SCORES_OUTPUT_FILE, and make sure that your scoring function reads and writes files with the correct format. I strongly discourage this use because it can put some strain on the file system. More importantly, the scoring function is evaluated as a shell command as is, posing a significant security risk! The user is responsible for making sure this isn't abused.

KeyTypeValid valuesDescription
SCORING_FUNCTION_INPUT_FILEStringAbsolute pathsPath to the space-tabulated SMILES file containing SMILES-ID pairs written by StandaloneLEADD and read by SCORING_FUNCTION_CALL
SCORING_FUNCTION_CALLStringAnyCommand ran by StandaloneLEADD to convert SCORING_FUNCTION_INPUT_FILE into SCORES_OUTPUT_FILE
SCORES_OUTPUT_FILEStringAbsolute pathsPath to the space-tabulated file containing ID-score pairs generated by SCORING_FUNCTION_CALL and read by StandaloneLEADD
TEMPLATE_MOL_SMILESStringSMILESSMILES of the reference molecule used by LEADD's topological similarity scoring function
SCORE_FIRST_POPULATIONBoolean0 or 1Flag to score the randomly generated/read population

Reporting settings

LEADD can write out some statistics to CSVs and generate SVGs of the molecules over time. These settings are intended mostly for debugging purposes.

KeyTypeValid valuesDescription
WRITE_REPORTBoolean0 or 1Flag to write out evolution statistics to CSVs
MONITOR_BEST_MOLECULEBoolean0 or 1Flag to write out the top molecule's connector weight arrays to CSV files and generate SVGs of its molecular graph

FAQ

Does LEADD support all types of molecules?

LEADD was designed first and foremost with drug-like molecules in mind, and may or may not work well with:

Can LEADD design ring systems?

LEADD's genetic operators work correctly with existing ring systems but won't create new ones. This is an arbitrary limitation due to the complexity of designing reasonable ring systems. If you are interested in ring design you could incorporate a suitable cyclization operator.

Why do I need to specify the number of ring atoms in my designed molecules?

If you don't fragment ring systems (recommended) and you aren't using atomic fragments, LEADD's fragmentation scheme causes an overrepresentation of acyclic fragments. This could cause the design of very flexible non-drug-like molecules. As a solution LEADD samples acyclic and cyclic fragments separately. The user settings determine the likelihood of sampling each type of fragment given the current number of ring atoms. However, this is just a probabilistic model, and the number of ring atoms in your designed molecules will deviate from it should that lead to better fitness values. If you don't have a rough idea of how cyclic your designed molecules should be you can specify a very broad range. You will probably have to compensate for this by increasing the number of generations/population size.

What should I consider when selecting input molecules to create the fragments database?

Can I use a 3D scoring function?

The molecule objects used by LEADD are 2D. Naturally, you are free to generate 3D coordinates for them and score those instead. If you are going to use 3D scoring functions it's recommended to enable the stereoflip operator. Please be aware that the conformations/orientations of a molecule can change significantly between genetic operations, which also may have implications during guided evolution.

Troubleshooting

If you have any problems, questions or suggestions please open a GitHub Issue/Discussion.