Home

Awesome

ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information

This repository contains code, model, dataset for ChineseBERT at ACL2021.

ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information
Zijun Sun, Xiaoya Li, Xiaofei Sun, Yuxian Meng, Xiang Ao, Qing He, Fei Wu and Jiwei Li

Guide

SectionDescription
IntroductionIntroduction to ChineseBERT
DownloadDownload links for ChineseBERT
Quick tourLearn how to quickly load models
ExperimentExperiment results on different Chinese NLP datasets
CitationCitation
ContactHow to contact us

Introduction

We propose ChineseBERT, which incorporates both the glyph and pinyin information of Chinese characters into language model pretraining.

First, for each Chinese character, we get three kind of embedding.

Then, char embedding, glyph embedding and pinyin embedding are first concatenated, and mapped to a D-dimensional embedding through a fully connected layer to form the fusion embedding.
Finally, the fusion embedding is added with the position embedding, which is fed as input to the BERT model.
The following image shows an overview architecture of ChineseBERT model.

MODEL

ChineseBERT leverages the glyph and pinyin information of Chinese characters to enhance the model's ability of capturing context semantics from surface character forms and disambiguating polyphonic characters in Chinese.

Download

We provide pre-trained ChineseBERT models in Pytorch version and followed huggingFace model format.

Our model can be downloaded here:

ModelModel HubGoogle Drive
ChineseBERT-base564M560M
ChineseBERT-large1.4G1.4G

Note: The model hub contains model, fonts and pinyin config files.

Quick tour

We train our model with Huggingface, so the model can be easily loaded.
Download ChineseBERT model and save at [CHINESEBERT_PATH].
Here is a quick tour to load our model.

>>> from models.modeling_glycebert import GlyceBertForMaskedLM

>>> chinese_bert = GlyceBertForMaskedLM.from_pretrained([CHINESEBERT_PATH])
>>> print(chinese_bert)

The complete example can be find here: Masked word completion with ChineseBERT

Another example to get representation of a sentence:

>>> from datasets.bert_dataset import BertDataset
>>> from models.modeling_glycebert import GlyceBertModel

>>> tokenizer = BertDataset([CHINESEBERT_PATH])
>>> chinese_bert = GlyceBertModel.from_pretrained([CHINESEBERT_PATH])
>>> sentence = '我喜欢猫'

>>> input_ids, pinyin_ids = tokenizer.tokenize_sentence(sentence)
>>> length = input_ids.shape[0]
>>> input_ids = input_ids.view(1, length)
>>> pinyin_ids = pinyin_ids.view(1, length, 8)
>>> output_hidden = chinese_bert.forward(input_ids, pinyin_ids)[0]
>>> print(output_hidden)
tensor([[[ 0.0287, -0.0126,  0.0389,  ...,  0.0228, -0.0677, -0.1519],
         [ 0.0144, -0.2494, -0.1853,  ...,  0.0673,  0.0424, -0.1074],
         [ 0.0839, -0.2989, -0.2421,  ...,  0.0454, -0.1474, -0.1736],
         [-0.0499, -0.2983, -0.1604,  ..., -0.0550, -0.1863,  0.0226],
         [ 0.1428, -0.0682, -0.1310,  ..., -0.1126,  0.0440, -0.1782],
         [ 0.0287, -0.0126,  0.0389,  ...,  0.0228, -0.0677, -0.1519]]],
       grad_fn=<NativeLayerNormBackward>)

The complete code can be find HERE

Experiments

ChnSetiCorp

ChnSetiCorp is a dataset for sentiment analysis.
Evaluation Metrics: Accuracy

ModelDevTest
ERNIE95.495.5
BERT95.195.4
BERT-wwm95.495.3
RoBERTa95.095.6
MacBERT95.295.6
ChineseBERT95.695.7
--------
RoBERTa-large95.895.8
MacBERT-large95.795.9
ChineseBERT-large95.895.9

Training details and code can be find HERE

THUCNews

THUCNews contains news in 10 categories.
Evaluation Metrics: Accuracy

ModelDevTest
ERNIE95.495.5
BERT95.195.4
BERT-wwm95.495.3
RoBERTa95.095.6
MacBERT95.295.6
ChineseBERT95.695.7
--------
RoBERTa-large95.895.8
MacBERT-large95.795.9
ChineseBERT-large95.895.9

Training details and code can be find HERE

XNLI

XNLI is a dataset for natural language inference.
Evaluation Metrics: Accuracy

ModelDevTest
ERNIE79.778.6
BERT79.078.2
BERT-wwm79.478.7
RoBERTa80.078.8
MacBERT80.379.3
ChineseBERT80.579.6
--------
RoBERTa-large82.181.2
MacBERT-large82.481.3
ChineseBERT-large82.781.6

Training details and code can be find HERE

BQ

BQ Corpus is a sentence pair matching dataset.
Evaluation Metrics: Accuracy

ModelDevTest
ERNIE86.385.0
BERT86.185.2
BERT-wwm86.485.3
RoBERTa86.085.0
MacBERT86.085.2
ChineseBERT86.485.2
--------
RoBERTa-large86.385.8
MacBERT-large86.285.6
ChineseBERT-large86.586.0

Training details and code can be find HERE

LCQMC

LCQMC Corpus is a sentence pair matching dataset.
Evaluation Metrics: Accuracy

ModelDevTest
ERNIE89.887.2
BERT89.487.0
BERT-wwm89.687.1
RoBERTa89.086.4
MacBERT89.587.0
ChineseBERT89.887.4
--------
RoBERTa-large90.487.0
MacBERT-large90.687.6
ChineseBERT-large90.587.8

Training details and code can be find HERE

TNEWS

TNEWS is a 15-class short news text classification dataset. <br> Evaluation Metrics: Accuracy

ModelDevTest
ERNIE58.2458.33
BERT56.0956.58
BERT-wwm56.7756.86
RoBERTa57.5156.94
ChineseBERT58.6458.95
--------
RoBERTa-large58.3258.61
ChineseBERT-large59.0659.47

Training details and code can be find HERE

CMRC

CMRC is a machin reading comprehension task dataset.
Evaluation Metrics: EM

ModelDevTest
ERNIE66.8974.70
BERT66.7771.60
BERT-wwm66.9673.95
RoBERTa67.8975.20
MacBERT--
ChineseBERT67.9595.7
--------
RoBERTa-large70.5977.95
ChineseBERT-large70.7078.05

Training details and code can be find HERE

OntoNotes

OntoNotes 4.0 is a Chinese named entity recognition dataset and contains 18 named entity types. <br>

Evaluation Metrics: Span-Level F1

ModelTest PrecisionTest RecallTest F1
BERT79.6982.0980.87
RoBERTa80.4380.3080.37
ChineseBERT80.0383.3381.65
------------
RoBERTa-large80.7282.0781.39
ChineseBERT-large80.7783.6582.18

For reproducing experiment results, please install and use torch1.7.1+cu101 via pip install torch==1.7.1+cu101 torchvision==0.8.2+cu101 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html. <br> Training details and code can be find HERE

Weibo

Weibo is a Chinese named entity recognition dataset and contains 4 named entity types. <br>

Evaluation Metrics: Span-Level F1

ModelTest PrecisionTest RecallTest F1
BERT67.1266.8867.33
RoBERTa68.4967.8168.15
ChineseBERT68.2769.7869.02
------------
RoBERTa-large66.7470.0268.35
ChineseBERT-large68.7572.9770.80

For reproducing experiment results, please install and use torch1.7.1+cu101 via pip install torch==1.7.1+cu101 torchvision==0.8.2+cu101 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html. <br> Training details and code can be find HERE

Citation

@article{sun2021chinesebert,
  title={ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information},
  author={Sun, Zijun and Li, Xiaoya and Sun, Xiaofei and Meng, Yuxian and Ao, Xiang and He, Qing and Wu, Fei and Li, Jiwei},
  journal={arXiv preprint arXiv:2106.16038},
  year={2021}
}

Contact

If you have any question about our paper/code/modal/data...
Please feel free to discuss through github issues or emails.
You can send emails to zijun_sun@shannonai.com OR xiaoya_li@shannonai.com