Awesome
t-SMILES: A Scalable Fragment-based Molecular Representation Framework
When using advanced NLP methodologies to solve chemical problems, two fundamental questions arise: 1) What are 'chemical words'? and 2) How can they be encoded as 'chemical sentences’?
This study introduces a scalable, fragment-based, multiscale molecular representation algorithm called t-SMILES (tree-based SMILES) to address the second question. It describes molecules using SMILES-type strings obtained by performing a breadth-first search on a full binary tree formed from a fragmented molecular graph.
For more details, please refer to the papers.
TSSA, TSDY, TSID: https://www.nature.com/articles/s41467-024-49388-6
TSIS (TSIS, TSISD, TSISO, TSISR): https://arxiv.org/abs/2402.02164
Systematic evaluations using JTVAE, BRICS, MMPA, and Scaffold show that:
-
It can build a multi-code molecular description system, where various descriptions complement each other, enhancing the overall performance. Under this framework, classical SMILES can be unified as a special case of t-SMILES to achieve better balanced performance using hybrid decomposition algorithms.
-
It exhibits impressive performance on low-resource datasets JNK3 and AID1706, whether the model is original, data augmented, or pre-training fine-tuned;
-
It significantly outperforms classical SMILES, DeepSMILES, SELFIES and baseline models in goal-directed tasks.
-
It outperforms previous fragment-based models being competitive with classical SMILES and graph-based methods on Zinc, QM9, and ChEMBL.
To support the t-SMILES algorithm, we introduce a new character, '&', to act as a tree node when the node is not a real fragment in FBT. Additionally, we introduce another new character, '^', to separate two adjacent substructure segments in t-SMILES string, similar to the blank space in English sentences that separates two words.
Four coding algorithms are presented in these studies:
-
TSSA: t-SMILES with shared atom.
-
TSDY: t-SMILES with dummy atom but without ID.
-
TSID: t-SMILES with ID and dummy atom.
-
TSIS: Simplified TSID, including TSIS, TSISD, TSISO, TSISR.
For example, the six t-SMILES codes of Celecoxib are:
TSID_M:
- [1*]C&[1*]C1=CC=C([2*])C=C1&[2*]C1=CC([3*])=NN1[5*]&[3*]C([4*])(F)F&[4*]F^[5*]C1=CC=C([6*])C=C1&&[6*]S(N)(=O)=O&&&
TSDY_M (replace [n*] with *):
- *C&*C1=CC=C(*)C=C1&*C1=CC(*)=NN1*&*C(*)(F)F&*F^*C1=CC=C(*)C=C1&&*S(N)(=O)=O&&&
TSSA_M:
- CC&C1=CC=CC=C1&CC&C1=C[NH]N=C1&CN&C1=CC=CC=C1^CC^CS&C&N[SH]=O&CF&&&&FCF&&
TSIS_M:
- [1*]C^[1*]C1=CC=C([2*])C=C1^[2*]C1=CC([3*])=NN1[5*]^[3*]C([4*])(F)F^[5*]C1=CC=C([6*])C=C1^[4*]F^[6*]S(N)(=O)=O
TSISD_M:
- [1*]C^[1*]C1=CC=C([2*])C=C1^[2*]C1=CC([3*])=NN1[5*]^[3*]C([4*])(F)F^[4*]F^[5*]C1=CC=C([6*])C=C1^[6*]S(N)(=O)=O
TSISO_M:
- [2*]C1=CC([3*])=NN1[5*]^[1*]C1=CC=C([2*])C=C1^[5*]C1=CC=C([6*])C=C1^[3*]C([4*])(F)F^[6*]S(N)(=O)=O^[1*]C^[4*]F
Here we provide the source code of our method.
Dependencies
We recommend Anaconda to manage the version of Python and installed packages.
Please make sure the following packages are installed:
-
Python(version >= 3.7)
-
PyTorch (version == 1.7)
-
RDKit (version >= 2020.03)
-
Networkx(version >= 2.4)
-
Numpy (version >= 1.19)
-
Pandas (version >= 1.2.2)
-
Matplotlib (version >= 2.0)
-
Scipy(version >= 1.4.1)
As to Datamol and rBRICS: please download them from https://github.com/datamol-io/datamol and https://github.com/BiomedSciAI/r-BRICS and copy them into the MolUtils folder.
Usage
- DataSet/Graph/CNJTMol.py
encode_single ()
It contained a preprocess function to generate t-SMILES from data set.
- DataSet/Graph/CNJMolAssembler.py
decode_single()
It reconstructs molecules form t-SMILES to generate classical SMILES.
In this study, GPT and RNN generative models are used for evaluation.
Acknowledgement
We thank the following Git repositories that gave me a lot of inspirations:
-
MolGPT : https://github.com/devalab/molgpt
-
hgraph2graph: https://github.com/wengong-jin/hgraph2graph
-
DeepSmiles: https://github.com/baoilleach/deepsmiles
-
AttentiveFP: https://github.com/OpenDrugAI/AttentiveFP
-
Guacamol: https://github.com/BenevolentAI/guacamol\_baselines
-
GPT2: https://github.com/samwisegamjeee/pytorch-transformers