Awesome
MTransformer
Materials Transformers
Ciation: Fu, Nihang, Lai Wei, Yuqi Song, Qinyang Li, Rui Xin, Sadman Sadeed Omee, Rongzhi Dong, Edirisuriya M. Dilanga Siriwardane, and Jianjun Hu. "Material transformers: deep learning language models for generative materials design." Machine Learning: Science and Technology 4, no. 1 (2023): 015001. PDF
by <a href="http://mleg.cse.sc.edu" target="_blank">Machine Learning and Evolution Laboratory</a>, University of South Carolina
Benchmark Datasets for training inorganic materials composition transformers
ICSD-mix dataset (52317 samples)
ICSD-pure dataset (39431 samples)
Hybrid-mix dataset (418983 samples)
Hybrid-pure dataset (257138 samples)
Hybrid-strict dataset (212778 samples)
All above datasets can be downloaded from Figshare
Trained Materials Transformer Models
ICSD-mix | ICSD-pure | Hybrid-mix | Hybrid-pure | Hybrid-strict | |
---|---|---|---|---|---|
MT-GPT | GPT-Im | GPT-Ip | GPT-Hm | GPT-Hp | GPT-Hs |
MT-GPT2 | GPT2-Im | GPT2-Ip | GPT2-Hm | GPT2-Hp | GPT2-Hs |
MT-GPTJ | GPTJ-Im | GPTJ-Ip | GPTJ-Hm | GPTJ-Hp | GPTJ-Hs |
MT-GPTNeo | GPTNeo-Im | GPTNeo-Ip | GPTNeo-Hm | GPTNeo-Hp | GPTNeo-Hs |
MT-BART | BART-Im | BART-Ip | BART-Hm | BART-Hp | BART-Hs |
MT-RoBERTa | RoBERTa-Im | RoBERTa-Ip | RoBERTa-Hm | RoBERTa-Hp | RoBERTa-Hs |
How to train with your own dataset
Installation
- Create your own conda or other enviroment.
- install basic packages
pip install -r requirements.txt
- Install
pytorch
from pytorch web given your python & cuda version
Data preparation
Download datasets from the above link, then unzip it under MT_dataset
folder.
After the above, the directory should be:
MTransformer
├── MT_dataset
├── hy_mix
├── test.txt
├── train.txt
├── valid.txt
├── hy_pure
├── hy_strict
├── icsd_mix
├── icsd_pure
├── mp
├── MT_models
├── MT_Bart
├── hy_mix
├── config.json
├── pytorch_model.bin
├── training_args.bin
├── hy_pure
├── hy_strict
├── icsd_mix
├── icsd_pure
├── MT_GPT
├── MT_GPT2
├── MT_GPTJ
├── MT_GPTNeo
├── MT_RoBERTa
├── tokenizer
├── vocab.txt
├── generateFormula_random.py
├── multi_generateFormula_random.py
├── README.md
└── requirements.txt
Training
An example is to train a MT-GPT model on the Hybrid-mix dataset.
python ./MT_model/MT_GPT/train_GPT.py --tokenizer ./MT_model/tokenizer/ --train_data ./MT_Dataset/hy_mix/train.txt --valid_data ./MT_Dataset/hy_mix/valid.txt --output_dir ./output
The training for other models is similar to MT-GPT.
How to generate new materials compositions/formula using the trained models
Download models from the above link or use your own trianed models, then put them into correspoding folders.
Generate materials formulas using the trained MT-GPT model.
python generateFormula_random.py --tokenizer ./MT_model/tokenizer --model_name OpenAIGPTLMHeadModel --model_path ./MT_model/MT_GPT2/hy_mix
We also provide the multi-thread generation. The default number of threads is 10, and you can change it using arg n_thread
.
python multi_generateFormula_random.py --tokenizer ./tokenizer --model_name GPT2LMHeadModel --model_path ./MT_GPT2/hy_mix --n_thread 5