Home

Awesome

HanBert-Transformers

HanBert on ๐Ÿค— Huggingface Transformers ๐Ÿค—

Details

# transformers bert TF_CHECKPOINT TF_CONFIG PYTORCH_DUMP_OUTPUT
$ transformers bert HanBert-54kN/model.ckpt-3000000 \
                    HanBert-54kN/bert_config.json \
                    HanBert-54kN/pytorch_model.bin

How to Use

  1. ๊ด€๋ จ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์„ค์น˜

    • torch>=1.1.0
    • transformers>=2.2.2
  2. ๋ชจ๋ธ ๋‹ค์šด๋กœ๋“œ ํ›„ ์••์ถ• ํ•ด์ œ

    • ๊ธฐ์กด์˜ HanBert์—์„œ๋Š” tokenization ๊ด€๋ จ ํŒŒ์ผ์„ /usr/local/moran์— ๋ณต์‚ฌํ•ด์•ผ ํ–ˆ์ง€๋งŒ, ํ•ด๋‹น ํด๋” ์ด์šฉ ์‹œ ๊ทธ๋ž˜๋„ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.
    • ๋‹ค์šด๋กœ๋“œ ๋งํฌ (Pretrained weight + Tokenizer tool)
  3. tokenization_hanbert.py ์ค€๋น„

    • Tokenizer์˜ ๊ฒฝ์šฐ Ubuntu ํ™˜๊ฒฝ์—์„œ๋งŒ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.
    • ํ•˜๋‹จ์˜ ํ˜•ํƒœ๋กœ ๋””๋ ‰ํ† ๋ฆฌ๊ฐ€ ์„ธํŒ…์ด ๋˜์–ด ์žˆ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
.
โ”œโ”€โ”€ ...
โ”œโ”€โ”€ HanBert-54kN-torch
โ”‚   โ”œโ”€โ”€ config.json
โ”‚   โ”œโ”€โ”€ pytorch_model.bin
โ”‚   โ”œโ”€โ”€ vocab_54k.txt
โ”‚   โ”œโ”€โ”€ libmoran4dnlp.so
โ”‚   โ”œโ”€โ”€ moran.db
โ”‚   โ”œโ”€โ”€ udict.txt
โ”‚   โ””โ”€โ”€ uentity.txt
โ”œโ”€โ”€ tokenization_hanbert.py
โ””โ”€โ”€ ...

Example

1. Model

>>> import torch
>>> from transformers import BertModel

>>> model = BertModel.from_pretrained('HanBert-54kN-torch')
>>> input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
>>> token_type_ids = torch.LongTensor([[0, 0, 0], [0, 0, 0]])
>>> attention_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
>>> sequence_output, pooled_output = model(input_ids, attention_mask, token_type_ids)
>>> sequence_output
tensor([[[-0.0938, -0.5030,  0.3765,  ..., -0.4880, -0.4486,  0.3600],
         [-0.6036, -0.1008, -0.2344,  ..., -0.6606, -0.5762,  0.1021],
         [-0.4353,  0.0970, -0.0781,  ..., -0.7686, -0.4418,  0.4109]],

        [[-0.7117,  0.2479, -0.8164,  ...,  0.1509,  0.8337,  0.4054],
         [-0.7867, -0.0443, -0.8754,  ...,  0.0952,  0.5044,  0.5125],
         [-0.8613,  0.0138, -0.9315,  ...,  0.1651,  0.6647,  0.5509]]],
       grad_fn=<AddcmulBackward>)

2. Tokenizer

>>> from tokenization_hanbert import HanBertTokenizer
>>> tokenizer = HanBertTokenizer.from_pretrained('HanBert-54kN-torch')
>>> text = "๋‚˜๋Š” ๊ฑธ์–ด๊ฐ€๊ณ  ์žˆ๋Š” ์ค‘์ž…๋‹ˆ๋‹ค. ๋‚˜๋Š”๊ฑธ์–ด ๊ฐ€๊ณ ์žˆ๋Š” ์ค‘์ž…๋‹ˆ๋‹ค. ์ž˜ ๋ถ„๋ฅ˜๋˜๊ธฐ๋„ ํ•œ๋‹ค. ์ž˜ ๋จน๊ธฐ๋„ ํ•œ๋‹ค."
>>> tokenizer.tokenize(text)
['๋‚˜', '~~๋Š”', '๊ฑธ์–ด๊ฐ€', '~~๊ณ ', '์žˆ', '~~๋Š”', '์ค‘', '~~์ž…', '~~๋‹ˆ๋‹ค', '.', '๋‚˜', '##๋Š”๊ฑธ', '##์–ด', '๊ฐ€', '~~๊ณ ', '~์žˆ', '~~๋Š”', '์ค‘', '~~์ž…', '~~๋‹ˆ๋‹ค', '.', '์ž˜', '๋ถ„๋ฅ˜', '~~๋˜', '~~๊ธฐ', '~~๋„', 'ํ•œ', '~~๋‹ค', '.', '์ž˜', '๋จน', '~~๊ธฐ', '~~๋„', 'ํ•œ', '~~๋‹ค', '.']

3. Test with python file

$ python3 test_hanbert.py --model_name_or_path HanBert-54kN-torch
$ python3 test_hanbert.py --model_name_or_path HanBert-54kN-IP-torch

Result on Subtask

max_seq_len = 50์œผ๋กœ ์„ค์ •

NSMC (acc)Naver-NER (F1)
HanBert-54kN90.1687.31
HanBert-54kN-IP88.7286.57
KoBERT89.6386.11
Bert-multilingual87.0784.20

Reference