Home

Awesome

MTransformer

Materials Transformers

Ciation: Fu, Nihang, Lai Wei, Yuqi Song, Qinyang Li, Rui Xin, Sadman Sadeed Omee, Rongzhi Dong, Edirisuriya M. Dilanga Siriwardane, and Jianjun Hu. "Material transformers: deep learning language models for generative materials design." Machine Learning: Science and Technology 4, no. 1 (2023): 015001. PDF

by <a href="http://mleg.cse.sc.edu" target="_blank">Machine Learning and Evolution Laboratory</a>, University of South Carolina

Benchmark Datasets for training inorganic materials composition transformers

ICSD-mix dataset (52317 samples)

ICSD-pure dataset (39431 samples)

Hybrid-mix dataset (418983 samples)

Hybrid-pure dataset (257138 samples)

Hybrid-strict dataset (212778 samples)

All above datasets can be downloaded from Figshare

Trained Materials Transformer Models

ICSD-mixICSD-pureHybrid-mixHybrid-pureHybrid-strict
MT-GPTGPT-ImGPT-IpGPT-HmGPT-HpGPT-Hs
MT-GPT2GPT2-ImGPT2-IpGPT2-HmGPT2-HpGPT2-Hs
MT-GPTJGPTJ-ImGPTJ-IpGPTJ-HmGPTJ-HpGPTJ-Hs
MT-GPTNeoGPTNeo-ImGPTNeo-IpGPTNeo-HmGPTNeo-HpGPTNeo-Hs
MT-BARTBART-ImBART-IpBART-HmBART-HpBART-Hs
MT-RoBERTaRoBERTa-ImRoBERTa-IpRoBERTa-HmRoBERTa-HpRoBERTa-Hs

How to train with your own dataset

Installation

  1. Create your own conda or other enviroment.
  2. install basic packages
pip install -r requirements.txt
  1. Install pytorch from pytorch web given your python & cuda version

Data preparation

Download datasets from the above link, then unzip it under MT_dataset folder. After the above, the directory should be:

MTransformer
   ├── MT_dataset
       ├── hy_mix
           ├── test.txt
           ├── train.txt
           ├── valid.txt
       ├── hy_pure
       ├── hy_strict
       ├── icsd_mix
       ├── icsd_pure
       ├── mp
   ├── MT_models
       ├── MT_Bart
           ├── hy_mix
               ├── config.json
               ├── pytorch_model.bin
               ├── training_args.bin
           ├── hy_pure
           ├── hy_strict
           ├── icsd_mix
           ├── icsd_pure
       ├── MT_GPT
       ├── MT_GPT2
       ├── MT_GPTJ
       ├── MT_GPTNeo
       ├── MT_RoBERTa
       ├── tokenizer
           ├── vocab.txt       
   ├── generateFormula_random.py
   ├── multi_generateFormula_random.py
   ├── README.md
   └── requirements.txt

Training

An example is to train a MT-GPT model on the Hybrid-mix dataset.

python ./MT_model/MT_GPT/train_GPT.py  --tokenizer ./MT_model/tokenizer/   --train_data  ./MT_Dataset/hy_mix/train.txt  --valid_data ./MT_Dataset/hy_mix/valid.txt  --output_dir ./output

The training for other models is similar to MT-GPT.

How to generate new materials compositions/formula using the trained models

Download models from the above link or use your own trianed models, then put them into correspoding folders.

Generate materials formulas using the trained MT-GPT model.

python generateFormula_random.py  --tokenizer ./MT_model/tokenizer  --model_name OpenAIGPTLMHeadModel  --model_path ./MT_model/MT_GPT2/hy_mix

We also provide the multi-thread generation. The default number of threads is 10, and you can change it using arg n_thread.

python multi_generateFormula_random.py  --tokenizer ./tokenizer  --model_name GPT2LMHeadModel  --model_path ./MT_GPT2/hy_mix  --n_thread 5