Home

Awesome

MELT: Materials-aware Continued Pre-training for Language Model Adaptation to Materials Science

This repository is about the paper, MELT: Materials-aware Continued Pre-training for Language Model Adaptation to Materials Science, accepted in Findings of EMNLP 2024. In this project, we are interested in expanding the material-aware entities to continue pre-training the PLMs.

Requirements

Pre-processing

Prepare the pre-training corpora (e.g., scientific papers) in raw_data folder. We upload the sampled pre-training corpora in raw_data folder (train_sampled.txt).

  1. Run bash scripts/bash preprocess.sh to normalize and split the raw sentences with max lengths.
  1. Run bash scripts/find_entities.sh to preprocess the positions of material-aware entities in the pre-processed sentences.

Pre-training

To continued pre-train PLMs, run bash scripts/pretrain.sh for distillation.

Fine-tuning

Run the following files with the pre-trained weights using argument name --load_weight

  1. MatSciNLP: bash scripts/run_matscinlp.sh

  2. NER (SOFC-NER, SOFC-Filling, MatScholar): bash scripts/run_ner.sh

  3. Classification (Glass Science): bash scripts/run_cls.sh

Contact Info

For help or issues using MELT, please submit a GitHub issue.

For personal communication related to MELT, please contact Junho Kim <monocrat@korea.ac.kr>.