Home

Awesome

BERT-CCPoem

Introduction

BERT-CCPoem is an BERT-based pre-trained model particularly for Chinese classical poetry, developed by Research Center for Natural Language Processing, Computational Humanities and Social Sciences, Tsinghua University (清华⼤学⼈⼯智能研究院⾃然语⾔处理与社会⼈⽂计算研究中⼼).

BERT-CCPoem is trained on a (almost) full collection of Chinese classical poems, CCPC-Full v1.0, consisting of 926,024 classical poems with 8,933,162 sentences. Basically, it can provide the vector (embedding) representation of any sentence in any Chinese classical poem, and thus be used in various downstream applications including intelligent poetry retrieval, recommendation and sentiment analysis.

A typical application is, you can use vector representation derived from BERT-CCPoem to get the most semantically similar sentences of a given sentence, in terms of the related cosine values. For example, provided a poem sentence "一行白鹭上青天", the top 10 most likely sentences given by BERT-CCPoem are as follows:.

RankPoem sentenceCosine similarityRankPoem sentenceCosine similarity
1白鹭一行登碧霄0.93316一行白鸟掠清波0.9024
2一片青天白鹭前0.91857时向青空飞白鹭0.9023
3飞却青天白鹭鸶0.91558一行飞鸟来青天0.9005
4一双白鹭上云飞0.91189一行白鹭下汀洲0.8994
5白鹭一行飞绿野0.906510一行飞鹭下汀洲0.8962

The following is the top 10 mostly likely sentences given by the string matching algorithm, for comparison:

RankPoem sentenceRankPoem sentence
1数行白鹭横青湖6一行白鹭渺秋烟
2一片青天白鹭前7一行白鹭引舟行
3一行飞鸟来青天8一行白鹭过前山
4一行白鹭下汀洲9一行白雁遥天暮
5一行白鹭云间绕10一行白雁天边字

Model details

We use "BertModel" class in the open source project Transformers to train our model. BERT-CCPoem is fully based on CCPC-Full v1.0, and takes Chinese character as basic unit. Characters with frequency less than 3 is treated as [UNK], resulting in a vocabulary of 11, 809 character types.

The parameters of BERT-CCPoem are listed as follows:

modelversionparametersvocab_sizemodel_sizedownload_url
BERT-CCPoemv1.08-layer, 512-hidden, 8-heads11809162MBdownload

How to use

wget https://thunlp.oss-cn-qingdao.aliyuncs.com/BERT_CCPoem_v1.zip
unzip BERT_CCPoem_v1.zip
from transformers import BertModel, BertTokenizer
import torch
tokenizer = BertTokenizer.from_pretrained('./BERT_CCPoem_v1') 
model = BertModel.from_pretrained('./BERT_CCPoem_v1')
input_ids = torch.tensor(tokenizer.encode("一行白鹭上青天")).unsqueeze(0) 
outputs, _ = model(input_ids)
sen_emb = torch.mean(outputs, 1)[0] # This is the vector representation of "一行白鹭上青天"

Note You may check out the sample programs gen_vec_rep.py we offer.

Requirement.txt

torch>=1.2.0
transformers==4.3.3

Acknowledging and Citing BERT-CCPoem

We makes BERT-CCPoem available to research free of charge provided the proper reference is made using an appropriate citation.

When writing a paper or producing a software application, tool, or interface based on BERT-CCPoem, it is necessary to properly acknowledge using BERT-CCPoem as “We use BERT-CCPoem, a pre-trained model for Chinese classical poetry, developed by Research Center for Natural Language Processing, Computational Humanities and Social Sciences, Tsinghua University, to ……” and cite the GitHub website "https://github.com/THUNLP-AIPoet/BERT-CCPoem".

Contributors

Professor: Maosong Sun(孙茂松)

Students: Zhipeng Guo(郭志芃), Jinyi Hu(胡锦毅)

Contact Us

If you have any questions, suggestions or bug reports, please feel free to email hujy369@gmail.com or gzp9595@gmail.com.