Awesome

LEADD

About

LEADD (Lamarckian Evolutionary Algorithm for de novo Drug Design) is a tool for molecular design and optimization.

Molecules are represented as meta-graphs of molecular fragments. Fragments are extracted by fragmenting an input virtual library, and broken bonds are converted to labelled attachment points (a.k.a. connectors). Molecules are reassembled by genetic operators that combine fragments through said connectors. Knowledge-based connectivity rules, extracted from the same input library, are enforced throughout the process. A population of molecules is evolved stochastically through use of genetic operators, with the goal of optimizing a user-provided scoring function. A Lamarckian evolutionary mechanism adjusts the reproductive behaviour of the molecules based on the outcome of the previous generation.

A more detailed description of the algorithm can be found in the corresponding paper: Kerstjens, A., De Winter, H. LEADD: Lamarckian evolutionary algorithm for de novo drug design. J Cheminform 14, 3 (2022). https://doi.org/10.1186/s13321-022-00582-y.

Authors

Alan Kerstjens (https://github.com/alankerstjens)

Dependencies

CMake (>= 3.15)
C++17 and OpenMP 4.0 compliant compiler
Boost (>= 1.73). Make sure Boost.Program_options and Boost.Serialization are part of your Boost distribution!
RDKit (>= 2021.03.2)
SQLite (>= 3.31.1)
HDF5 (>= 1.10.4)

It's recommended to also build the Python bindings. On top of the above, you will need the following:

Python (>= 3.8.8)
pybind11 (>= 2.6.2). If you clone this repository (with git clone --recurse-submodules) it should be included in the distribution.

If you are interested in re-running the benchmarks you will also need:

GuacaMol

The versions listed above have been tested to work on both Ubuntu 20.04 and Windows 10. Older versions may also work, with the exception of the RDKit due to a bug in older versions.

Installation

The following sections assume you are installing LEADD on a GNU/Linux machine. If this isn't the case you may have to adapt these instructions slightly.

Clone this repository (along with the pybind11 submodule). The resulting directory will be referred to as ${LEADD}.

git clone --recurse-submodules https://github.com/UAMCAntwerpen/LEADD.git

Install the rest of the dependencies. If you don't mind a bloated installation you can install them into an Anaconda environment using the provided leadd_conda_env.yml file.

cd ${LEADD}
conda env create -f leadd_conda_env.yml
conda activate LEADD

If you installed the RDKit through Anaconda you can build LEADD as follows:

mkdir build && cd build
cmake -DRDKit_INCLUDE_DIRS=${CONDA_PREFIX}/include/rdkit -DRDKit_LIBRARY_DIRS=${CONDA_PREFIX}/lib ..
make install clean

Alternatively, if you installed the RDKit from source you should have an environment variable ${RDBASE} pointing to the RDKit's root directory:

mkdir build && cd build
cmake -DRDKit_INCLUDE_DIRS=${RDBASE}/Code -DRDKit_LIBRARY_DIRS=${RDBASE}/lib ..
make install clean

If you don't want to build the Python bindings add the -DBUILD_PYTHON_BINDINGS=OFF flag to the CMake commands. Otherwise you probably want to add ${LEADD}/lib to your ${PYTHONPATH} environment variable, for instance in your .bashrc file.

export PYTHONPATH="${PYTHONPATH}:${LEADD}/lib"

Example workflow

If you'd like to follow along you can find most referenced files in the example directory. Most executables are parallelized with OpenMP and will by default use all cores of your CPU. You can change this behaviour by setting the OMP_NUM_THREADS environment variable before running the command.

The first step is to create the fragments' SQLite3 database. In this example we will fragment PGK1_ligands.smi, splitting acyclic regions into atomic fragments and defining connectors with MMFF94 atom types.

cd ${LEADD}/example
${LEADD}/bin/Fragment -i PGK1_ligands.smi -o fragments.db -s fragmentation_settings.txt

Thereafter, we precompute which fragments are compatible with each connector and store the output in a file.

${LEADD}/bin/PrecalculateConnectionQueryResults -i fragments.db -o fragments.cqr -t reconstruction_settings.txt

If you want to enable guided evolution you will need to generate a fragments similarity matrix. If your fragments database is very large, think twice before creating this matrix as it can be very expensive!

${LEADD}/bin/MakeFragmentSimilarityMatrix -i fragments.db -o fragments.h5

If you would like to use the SAScore filter or score heuristic you will also need a feature count library. For this you can use any collection of drug-like molecules, for instance CHEMBL.

${LEADD}/bin/MakeFeatureLibrary -i CHEMBL.smi -o CHEMBL.fl

Now you are ready to run LEADD itself. Make sure the absolute paths to the files we just created are within your reconstruction_settings.txt file. You'll have to adapt the provided settings file a bit.

If you are using LEADD programatically with the Python bindings make sure your Python interpreter can find the LEADD shared library, for instance by adding ${LEADD}/lib to your ${PYTHONPATH}. The provided example script will attempt to rediscover one of the ligands we just fragmented:

python ${LEADD}/example/leadd_example.py reconstruction_settings.txt . -v

Alternatively, you can run LEADD as a standalone executable. Normally this requires coupling a scoring function. For this example we can use the included topological similarity scoring function to rediscover the same ligand:

${LEADD}/bin/StandaloneLEADD -s reconstruction_settings.txt -o . -v

Remember that LEADD is a stochastic algorithm. Try running it a couple of times to see what happens.

Settings

This section covers how you can configure LEADD through its settings files. For an executable's options run it from the command line with the --help flag.

The settings files are plain text files where each line consists of a KEY VALUE pair separated by a single space. Most KEYs are optional. If an optional KEY is omitted or if the VALUE is blank LEADD will use a default VALUE instead (see source code). If the KEY is required by the executable LEADD will throw an error. Unless otherwise specified, a KEY should be assumed to be optional. It's recommended to specify all KEYs, including optional ones, to avoid unexpected behaviour. Empty lines and lines starting with # aren't parsed.

Fragmentation settings

A template fragmentation settings file is provided with this distribution. For default setting values users should reference the source code.

Key	Type	Valid values	Description
`ATOM_TYPING_SCHEME`	Enum	`DUMMY`, `ATOMIC_NUMBER`, `MMFF`, `MORGAN`, `HASHED_MORGAN`	Atom typing scheme used to define connections
`SYSTEMATIC_FRAGMENTATION_SCHEME`	Enum	`NONE`, `LINEAR`, `SUBGRAPH`, `CIRCULAR`	Fragmentation scheme for molecular regions treated systematically (normally acyclic regions)
`MIN_FRAGMENT_SIZE`	Integer	>= 0	Minimum systematic fragment size (in number of heavy atoms). Only used when `SYSTEMATIC_FRAGMENTATION_SCHEME != NONE`
`MAX_FRAGMENT_SIZE`	Integer	>= 0 && >= `MIN_FRAGMENT_SIZE`	Maximum systematic fragment size (in number of heavy atoms). Only used when `SYSTEMATIC_FRAGMENTATION_SCHEME != NONE`
`FRAGMENT_RINGS`	Boolean	0 or 1	Flag to fragment cyclic regions systematically. For testing purposes only

Morgan atom typing settings

Key	Type	Valid values	Description
`MORGAN_RADIUS`	Integer	>= 0	Circular atomic environment radius used to define Morgan atom types. Only used when `ATOM_TYPING_SCHEME == MORGAN` or `ATOM_TYPING_SCHEME == HASHED_MORGAN`
`MORGAN_CONSIDER_CHIRALITY`	Boolean	0 or 1	Flag to use atom chirality as atomic invariant during Morgan atom type calculation. Only used when `ATOM_TYPING_SCHEME == MORGAN` or `ATOM_TYPING_SCHEME == HASHED_MORGAN`
`HASHED_MORGAN_N_BITS`	Integer	> 0	Modulo used to collapse raw Morgan feature IDs into Morgan atom types. Only used when `ATOM_TYPING_SCHEME == HASHED_MORGAN`

Reconstruction settings

A template reconstruction settings file is provided with this distribution. For default setting values users should reference the source code.

Input file settings

These settings are required by LEADD.

Key	Type	Valid values	Description
`FRAGMENT_DATABASE_FILE`	String	Absolute paths	Path to the SQLite3 database generated by `Fragment`
`CONNECTION_QUERY_RESULTS_FILE`	String	Absolute paths	Path to the `.cqr` generated by `PrecalculateConnectionQueryResults`

Fragment sampling settings

Key	Type	Valid values	Description
`ACYCLIC_FREQUENCY_GAMMA`	Float	Any	Exponent applied to acyclic fragment frequencies when calculating their sampling weight
`ACYCLIC_LEVEL_GAMMA`	Float	Any	Exponent applied to acyclic fragment sizes when calculating their sampling weight
`RING_FREQUENCY_GAMMA`	Float	Any	Exponent applied to cyclic fragment frequencies when calculating their sampling weight
`RING_LEVEL_GAMMA`	Float	Any	Exponent applied to cyclic fragment sizes when calculating their sampling weight
`SCORE_GAMMA`	Float	Any	Exponent applied to molecules' scores when calculating fragment sampling weights in transfections and parent sampling weights

Genetic operation probabilities

Key	Type	Valid values	Description
`PERIPHERAL_EXPANSION_PROBABILITY`	Integer	>= 0	Sampling weight of the peripheral expansion operator
`PERIPHERAL_DELETION_PROBABILITY`	Integer	>= 0	Sampling weight of the peripheral deletion operator
`PERIPHERAL_SUBSTITUTION_PROBABILITY`	Integer	>= 0	Sampling weight of the peripheral substitution operator
`PERIPHERAL_TRANSFECTION_PROBABILITY`	Integer	>= 0	Sampling weight of the peripheral transfection operator
`INTERNAL_EXPANSION_PROBABILITY`	Integer	>= 0	Sampling weight of the internal expansion operator
`INTERNAL_DELETION_PROBABILITY`	Integer	>= 0	Sampling weight of the internal deletion operator
`INTERNAL_SUBSTITUTION_PROBABILITY`	Integer	>= 0	Sampling weight of the internal substitution operator
`INTERNAL_TRANSFECTION_PROBABILITY`	Integer	>= 0	Sampling weight of the internal transfection operator
`TRANSLATION_PROBABILITY`	Integer	>= 0	Sampling weight of the translation operator
`STEREO_FLIP_PROBABILITY`	Integer	>= 0	Sampling weight of the stereo flip operator

Designed molecule settings

Key	Type	Valid values	Description
`N_RING_ATOMS_MEAN`	Integer	> 0	Mean number of ring atoms in designed molecules
`N_RING_ATOMS_STDEV`	Float	> 0	Standard deviation in the number of ring atoms in designed molecules
`MAX_N_RING_ATOMS`	Integer	> 0	Maximum allowed number of ring atoms
`PEAK_N_RING_ATOMS_PROBABILITY`	Float	0 < v < 1	Probability of keeping the number of ring atoms constant at `N_RING_ATOMS_MEAN`
`MIN_SEED_SIZE`	Integer	> 0	Minimum number of heavy atoms in random starting molecules
`ASSIGN_UNSPECIFIED_STEREO`	Boolean	0 or 1	Flag to assign random stereochemistry to unspecified chiral centers and stereochemical double bonds

Evolution settings

Key	Type	Valid values	Description
`PRNG_SEED`	Integer	>= 0	Pseudo-random number generator seed. `0` is reserved for random seeds
`N_SEEDS`	Integer	> 0	Number of random starting molecules
`N_GENERATIONS`	Integer	> 0	Maximum number of generations
`N_CHILDREN_PER_GENERATION`	Integer	> 0	Number of children bred each generation
`N_SURVIVORS_PER_GENERATION`	Integer	> 0	Number of molecules advancing to the next generation
`MAX_CHILD_SIMILARITY`	Float	0 < v < 1	Maximum topological similarity between any two molecules of the population
`TERMINATION_SCORE`	Float	> 0	Population score at which the process terminates
`AVERAGE_SCORE_AS_TERMINATION_CRITERION`	Boolean	0 or 1	Flag to use the average (instead of best) molecule score as `TERMINATION_SCORE`
`MAX_GENERATIONS_STUCK`	Integer	>= 0	Maximum number of generations without improvement in the `TERMINATION_SCORE`
`MAX_ATTEMPTS_PER_GENERATION`	Integer	>= 0	Maximum number of attempts at evolving a molecule before it's skipped

Synthetic accessibility settings

The SAScore filter will be employed if the FEATURE_LIBRARY_FILE is specified and MAX_SASCORE < 10. The SAScore heuristic will be employed if FEATURE_LIBRARY_FILE is specified and USE_SASCORE_HEURISTIC == 1. Both can be used simultaneously.

Key	Type	Valid values	Description
`FEATURE_LIBRARY_FILE`	String	Absolute paths	Path to the `.fl` generated by `MakeFeatureLibrary`
`MAX_SASCORE`	Float	0 < v <= 10	Maximum SAScore enforced by the SAScore filter
`SASCORE_HEURISTIC_MU`	Float	> 0	Parameter of the SAScore heuristic Gaussian function
`SASCORE_HEURISTIC_SIGMA`	Float	> 0	Parameter of the SAScore heuristic Gaussian function
`USE_SASCORE_HEURISTIC`	Boolean	0 or 1	Flag to enable the SAScore heuristic

Guided evolution settings

Guided evolution will be enabled when SIMILARITY_MATRIX_FILE is specified and at least one of ACYCLIC_POSITIVE_REINFORCEMENT, ACYCLIC_NEGATIVE_REINFORCEMENT, RING_POSITIVE_REINFORCEMENT or RING_NEGATIVE_REINFORCEMENT are > 0.

Key	Type	Valid values	Description
`SIMILARITY_MATRIX_FILE`	String	Absolute paths	Path to the HDF5 file generated by `MakeFragmentSimilarityMatrix`
`ACYCLIC_LEARNING_SIMILARITY_THRESHOLD`	Float	>= 0	Minimum topological similarity between two acyclic fragments for one to trigger weight adjustments of the other
`RING_LEARNING_SIMILARITY_THRESHOLD`	Float	>= 0	Minimum topological similarity between two cyclic fragments for one to trigger weight adjustments of the other
`ACYCLIC_POSITIVE_REINFORCEMENT`	Float	>= 0	Reinforcement factor when increasing acyclic fragments sampling weights
`ACYCLIC_NEGATIVE_REINFORCEMENT`	Float	>= 0	Reinforcement factor when decreasing acyclic fragments sampling weights
`RING_POSITIVE_REINFORCEMENT`	Float	>= 0	Reinforcement factor when increasing cyclic fragments sampling weights
`RING_NEGATIVE_REINFORCEMENT`	Float	>= 0	Reinforcement factor when decreasing cyclic fragments sampling weights

Restart settings

By default LEADD generates a random starting population. Alternatively, users may specify their own starting populations. This can be done in two ways:

Programatically with the LEADD::SetPopulation function. This can be done with both the C++ and Python APIs.
By specifying a .rst RESTART_INPUT_FILE.

Restart .rst files can be generated either by LEADD itself (if RESTART_OUTPUT_FILE is specified), by the ConvertToReconstruction executable or programatically with the MakePopulationFromSMILES function.

However, the user should be aware of the limitations of the latter two approaches. Since LEADD operates with meta-graphs, which are more information rich than regular molecular graphs, some compromises had to be made:

Acylic regions will always be transformed into atomic fragments
The generated connectors must also be present in the fragments database. Otherwise the molecule will be skipped.
You won't be able to use guided evolution. This is because we can't guarantee that the generated fragments are present in the database nor calculate fragment similarities on the go in an efficient way.

Key	Type	Valid values	Description
`RESTART_INPUT_FILE`	String	Absolute paths	Path to an input `.rst` file generated either by `ConvertToReconstruction` or LEADD
`RESET_WEIGHTS_ON_RESTART`	Boolean	0 or 1	Flag to reset connector weight arrays when restarting LEADD
`RESTART_OUTPUT_FILE`	String	Absolute paths	Path to an output `.rst` file written by LEADD
`N_GENERATIONS_PER_SAVE`	Integer	> 0	Number of generations between writing a `.rst` file

Scoring settings

LEADD comes out of the box with a topological similarity scoring function which will be employed if only TEMPLATE_MOL_SMILES is specified. This scoring function should only be used for testing purposes. For real drug discovery projects you should couple your own scoring function. The recommended way of doing this is programatically. A Python example of this can be found in this repository.

As a legacy way, you can also use the StandaloneLEADD executable. In this case, you must specify in the settings file SCORING_FUNCTION_INPUT_FILE, SCORING_FUNCTION_CALL and SCORES_OUTPUT_FILE, and make sure that your scoring function reads and writes files with the correct format. I strongly discourage this use because it can put some strain on the file system. More importantly, the scoring function is evaluated as a shell command as is, posing a significant security risk! The user is responsible for making sure this isn't abused.

Key	Type	Valid values	Description
`SCORING_FUNCTION_INPUT_FILE`	String	Absolute paths	Path to the space-tabulated SMILES file containing SMILES-ID pairs written by `StandaloneLEADD` and read by `SCORING_FUNCTION_CALL`
`SCORING_FUNCTION_CALL`	String	Any	Command ran by `StandaloneLEADD` to convert `SCORING_FUNCTION_INPUT_FILE` into `SCORES_OUTPUT_FILE`
`SCORES_OUTPUT_FILE`	String	Absolute paths	Path to the space-tabulated file containing ID-score pairs generated by `SCORING_FUNCTION_CALL` and read by `StandaloneLEADD`
`TEMPLATE_MOL_SMILES`	String	SMILES	SMILES of the reference molecule used by LEADD's topological similarity scoring function
`SCORE_FIRST_POPULATION`	Boolean	0 or 1	Flag to score the randomly generated/read population

Reporting settings

LEADD can write out some statistics to CSVs and generate SVGs of the molecules over time. These settings are intended mostly for debugging purposes.

Key	Type	Valid values	Description
`WRITE_REPORT`	Boolean	0 or 1	Flag to write out evolution statistics to CSVs
`MONITOR_BEST_MOLECULE`	Boolean	0 or 1	Flag to write out the top molecule's connector weight arrays to CSV files and generate SVGs of its molecular graph

FAQ

Does LEADD support all types of molecules?

LEADD was designed first and foremost with drug-like molecules in mind, and may or may not work well with:

Very small or very large molecules
Non-organic elements. I've only tested molecules with H, C, N, O, S, P and halogens.

Can LEADD design ring systems?

LEADD's genetic operators work correctly with existing ring systems but won't create new ones. This is an arbitrary limitation due to the complexity of designing reasonable ring systems. If you are interested in ring design you could incorporate a suitable cyclization operator.

Why do I need to specify the number of ring atoms in my designed molecules?

If you don't fragment ring systems (recommended) and you aren't using atomic fragments, LEADD's fragmentation scheme causes an overrepresentation of acyclic fragments. This could cause the design of very flexible non-drug-like molecules. As a solution LEADD samples acyclic and cyclic fragments separately. The user settings determine the likelihood of sampling each type of fragment given the current number of ring atoms. However, this is just a probabilistic model, and the number of ring atoms in your designed molecules will deviate from it should that lead to better fitness values. If you don't have a rough idea of how cyclic your designed molecules should be you can specify a very broad range. You will probably have to compensate for this by increasing the number of generations/population size.

What should I consider when selecting input molecules to create the fragments database?

Maximize the ring diversity. Since LEADD doesn't create new ring systems this is crucial.
Avoid entirely cyclical molecules. By default these won't be fragmented and will be skipped. Note that if you decide to split them LEADD won't reassemble ring systems (see above).
Be careful when using combinatorial chemistry libraries. You could face fragment frequency imbalances.
If you intend on using guided evolution you should limit the number (N) of fragments. The time to calculate and space to store the fragments similarity matrix scale with O(N^2) complexity. LEADD's execution time and memory use will scale with O(N) complexity.

Can I use a 3D scoring function?

The molecule objects used by LEADD are 2D. Naturally, you are free to generate 3D coordinates for them and score those instead. If you are going to use 3D scoring functions it's recommended to enable the stereoflip operator. Please be aware that the conformations/orientations of a molecule can change significantly between genetic operations, which also may have implications during guided evolution.

Troubleshooting

If you have any problems, questions or suggestions please open a GitHub Issue/Discussion.