Home

Awesome

Embrace the world of large language models!

Description

This repository stores the source code and data for the paper 'A Prompt-Engineered Large Language Model, Deep Learning Workflow for Materials Classification' published in Materials Today.
arXiv: 10.48550/arXiv.2401.17788
Materials Today: 10.1016/j.mattod.2024.08.028.

There are five folders here, namely the metallic glasses database source data folder, the large language model folder, the classification model folder, the model interpretation and visualization folder and the supplemetary_data_for_revision folder.

How to Cite

@article{LIU2024,
title = {A prompt-engineered large language model, deep learning workflow for materials classification},
journal = {Materials Today},
year = {2024},
issn = {1369-7021},
doi = {https://doi.org/10.1016/j.mattod.2024.08.028},
url = {https://www.sciencedirect.com/science/article/pii/S1369702124002001},
author = {Siyu Liu and Tongqi Wen and A.S.L. Subrahmanyam Pattamatta and David J. Srolovitz},
keywords = {Materials classification, Large language model, Prompt engineering, Deep learning},
abstract = {Large language models (LLMs) have demonstrated rapid progress across a wide array of domains. Owing to the very large number of parameters and training data in LLMs, these models inherently encompass an expansive and comprehensive materials knowledge database, far exceeding the capabilities of individual researcher. Nonetheless, devising methods to harness the knowledge embedded within LLMs for the design and discovery of novel materials remains a formidable challenge. We introduce a general approach for addressing materials classification problems, which incorporates LLMs, prompt engineering, and deep learning. Utilizing a dataset of metallic glasses as a case study, our methodology achieved an improvement of up to 463% in prediction accuracy compared to conventional classification models. These findings underscore the potential of leveraging textual knowledge generated by LLMs for materials especially in the common situation where datasets are sparse, thereby promoting innovation in materials discovery and design.}
}

Here are some steps for setting up the configuration.

Step1: Configure Python environment and libraries

All code is recommended to run in a Python virtual environment.

If you have not installed Python before, it is recommended to follow the following link for installation: Anaconda Installation

To create and activate a new conda environment, use the following command:

conda create --name bmg python=3.10
conda activate bmg

Then please use the following code to install the required Python packages:

pip install -r requirements.txt

Step2: Register Gemini API from Google

If you also want to generate text data through Gemini, please apply for a free API from Google Dev first.

Then copy and paste it to the .env file in llm folder:

GOOGLE_API_KEY='xxxxx'

Step3: Download Huggingface Pre-trained Models

Our classification model is fine tuned from pre-trained models. So if you want to repeat the training process by yourself, at least you need to obtain the model files.

You can directly load the model according to the official guide.

In case you want to download a pre-trained model from Huggingface, use the following command:

from huggingface_hub import snapshot_download
snapshot_download(repo_id="xxx", local_dir="xxx")

Replace repo_id and local_dir with the name of the model you want to download and the folder you want to store.

repo_id of MatSciBERT: m3rg-iitd/matscibert
repo_id of Longformer: allenai/longformer-base-4096
repo_id of BERT: bert-base-cased

If you just want to do inference with MgBERT, you can use the model weights file in the checkpoint folder:

cd classification_models/different_BERT/checkpoint

and load it with the inference_template file in interpretability_and_visualization folder.