Home

Awesome

RDKit IPython Tools

by Axel Pahl

A set of tools to use with the Open Source Cheminformatics toolkit RDKit in the Jupyter Notebook.<br> Written for Python3, only tested on Linux (Ubuntu 16.04) and the conda install of the RDkit.

Module tools

A Mol_List class was introduced, which is a subclass of a Python list for holding lists of RDKit molecule objects and allows direct access to a lot of the RDKit functionality. It is meant to be used with the Jupyter Notebook and includes a.o.:

Other functions in the tools module:

...plus many others.

Module pipeline

A Pipelining Workflow using Python Generators, mainly for RDKit and large compound sets. The use of generators allows working with arbitrarily large data sets, the memory usage at any given time is low.

Example use:

>>> from rdkit_ipynb_tools import pipeline as p
>>> s = Summary()
>>> rd = start_csv_reader(test_data_b64.csv.gz", summary=s)
>>> b64 = pipe_mol_from_b64(rd, summary=s)
>>> filt = pipe_mol_filter(b64, "[H]c2c([H])c1ncoc1c([H])c2C(N)=O", summary=s)
>>> stop_sdf_writer(filt, "test.sdf", summary=s)

or, using the pipe function:

>>> s = Summary()
>>> rd = start_sdf_reader("test.sdf", summary=s)
>>> pipe(rd,
>>>      pipe_keep_largest_fragment,
>>>      (pipe_neutralize_mol, {"summary": s}),
>>>      (pipe_keep_props, ["Ordernumber", "NP_Score"]),
>>>      (stop_csv_writer, "test.csv", {"summary": s})
>>>     )

The progress of the pipeline is displayed as a HTML table in the Notebook and can also be followed in a separate terminal with: watch -n 2 cat pipeline.log.

Currently Available Pipeline Components:

StartingRunningStopping
start_cache_readerpipe_calc_propsstop_cache_writer
start_csv_readerpipe_custom_filterstop_count_records
start_mol_csv_readerpipe_custom_manstop_csv_writer
start_sdf_readerpipe_do_nothingstop_df_from_stream
start_stream_from_dictpipe_has_prop_filterstop_dict_from_stream
start_stream_from_mol_listpipe_id_filterstop_mol_list_from_stream
pipe_inspect_streamstop_sdf_writer
pipe_join_data_from_file
pipe_keep_largest_fragment
pipe_keep_props
pipe_merge_data
pipe_mol_filter
pipe_mol_from_b64
pipe_mol_from_smiles
pipe_mol_to_b64
pipe_mol_to_smiles
pipe_neutralize_mol
pipe_remove_props
pipe_rename_prop
pipe_sim_filter
pipe_sleep

Limitation: unlike in other pipelining tools, because of the nature of Python generators, the pipeline can not be branched.

Other Modules

Clustering

Fully usable, documentation needs to be written. Please refer to the docstrings until then.

Scaffolds

New, WIP, not usable. Has been moved to the scaffolds branch.

Tutorial

Much of the functionality is shown in the tools tutorial notebook. SAR functionality is shown in the SAR tutorial notebook. The SAR module is new and Work in Progress.

Documentation

The module documentation can be built with sphinx using the make_doc.sh script

Installation

Requirements

The recommended way to use this project is via conda.

  1. Python 3
  2. RDKit
  3. Jupyter Notebook
  4. ipywidgets

Highly recommended

  1. cairo (via conda or pip) and cairocffi (only via pip) to get decent-looking structures
  2. Bokeh for high-quality data plots with structure tooltips

After installing the requirements, clone this repo, then the rdkit_ipynb_tools can be used by including the project's base directory (rdkit_ipynb_tools) in Python's import path (I actually prefer this to using setuptools, because a simple git pull will get you the newest version). <br> This can be achieved by one of the following: <br>

Tips & Tricks

Pipelines, Structures and Performance

Processing data from 200k compounds takes 10-15 sec on my notebook.

Substructure searches take longer.

For performance reasons, I use b64encode and pickle strings of mol objects to store the molecule structures in text format<br> (see also Greg's blog post for faster structure generation):

b64encode(pickle.dumps(mol)).decode()

For me, that has proven to be the fastest method when dealing with flat text files and is also the reason why there are pipe_mol_to_b64 and pipe_mol_from_b64 components in the pipeline module.

Working Offline

Roadmap

(probably not in this order)