


Materials Transformers

Ciation: Fu, Nihang, Lai Wei, Yuqi Song, Qinyang Li, Rui Xin, Sadman Sadeed Omee, Rongzhi Dong, Edirisuriya M. Dilanga Siriwardane, and Jianjun Hu. "Material transformers: deep learning language models for generative materials design." Machine Learning: Science and Technology 4, no. 1 (2023): 015001. PDF

by <a href="http://mleg.cse.sc.edu" target="_blank">Machine Learning and Evolution Laboratory</a>, University of South Carolina

Benchmark Datasets for training inorganic materials composition transformers

ICSD-mix dataset (52317 samples)

ICSD-pure dataset (39431 samples)

Hybrid-mix dataset (418983 samples)

Hybrid-pure dataset (257138 samples)

Hybrid-strict dataset (212778 samples)

All above datasets can be downloaded from Figshare

Trained Materials Transformer Models


How to train with your own dataset


  1. Create your own conda or other enviroment.
  2. install basic packages
pip install -r requirements.txt
  1. Install pytorch from pytorch web given your python & cuda version

Data preparation

Download datasets from the above link, then unzip it under MT_dataset folder. After the above, the directory should be:

   ├── MT_dataset
       ├── hy_mix
           ├── test.txt
           ├── train.txt
           ├── valid.txt
       ├── hy_pure
       ├── hy_strict
       ├── icsd_mix
       ├── icsd_pure
       ├── mp
   ├── MT_models
       ├── MT_Bart
           ├── hy_mix
               ├── config.json
               ├── pytorch_model.bin
               ├── training_args.bin
           ├── hy_pure
           ├── hy_strict
           ├── icsd_mix
           ├── icsd_pure
       ├── MT_GPT
       ├── MT_GPT2
       ├── MT_GPTJ
       ├── MT_GPTNeo
       ├── MT_RoBERTa
       ├── tokenizer
           ├── vocab.txt       
   ├── generateFormula_random.py
   ├── multi_generateFormula_random.py
   ├── README.md
   └── requirements.txt


An example is to train a MT-GPT model on the Hybrid-mix dataset.

python ./MT_model/MT_GPT/train_GPT.py  --tokenizer ./MT_model/tokenizer/   --train_data  ./MT_Dataset/hy_mix/train.txt  --valid_data ./MT_Dataset/hy_mix/valid.txt  --output_dir ./output

The training for other models is similar to MT-GPT.

How to generate new materials compositions/formula using the trained models

Download models from the above link or use your own trianed models, then put them into correspoding folders.

Generate materials formulas using the trained MT-GPT model.

python generateFormula_random.py  --tokenizer ./MT_model/tokenizer  --model_name OpenAIGPTLMHeadModel  --model_path ./MT_model/MT_GPT2/hy_mix

We also provide the multi-thread generation. The default number of threads is 10, and you can change it using arg n_thread.

python multi_generateFormula_random.py  --tokenizer ./tokenizer  --model_name GPT2LMHeadModel  --model_path ./MT_GPT2/hy_mix  --n_thread 5