Awesome

Introduction

With this repository, you will able to train Multi-label Classification with BERT,

Deploy BERT for online prediction.

You can also find the a short tutorial of how to use bert with chinese: <a href='https://github.com/brightmart/sentiment_analysis_fine_grain/blob/master/README_bert_chinese_tutorial.md'>BERT short chinese tutorial</a>

You can find Introduction to <a href='https://challenger.ai/competition/fsauor2018'>fine grain sentiment from AI Challenger</a>

Basic Ideas

Add something here.

Experiment on New Models

for more, check model/bert_cnn_fine_grain_model.py

Performance

Model	TextCNN(No-pretrain)	TextCNN(Pretrain-Finetuning)	Bert(base_model_zh)	Bert(base_model_zh,pre-train on corpus)
F1 Score	0.678	0.685	ADD A NUMBER HERE	ADD A NUMBER HERE

Notice: F1 Score is reported on validation set

Usage

Bert for Multi-label Classificaiton [<a href='https://pan.baidu.com/s/1ZS4dAdOIAe3DaHiwCDrLKw'>data for fine-tuning and pre-train</a>]

export BERT_BASE_DIR=BERT_BASE_DIR/chinese_L-12_H-768_A-12
export TEXT_DIR=TEXT_DIR
nohup python run_classifier_multi_labels_bert.py   
  --task_name=sentiment_analysis   
  --do_train=true   
  --do_eval=true  
  --data_dir=$TEXT_DIR   
  --vocab_file=$BERT_BASE_DIR/vocab.txt   
  --bert_config_file=$BERT_BASE_DIR/bert_config.json  
  --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt   
  --max_seq_length=512   
  --train_batch_size=4   
  --learning_rate=2e-5   
  --num_train_epochs=3   
  --output_dir=./checkpoint_bert &

1.firstly, you need to download pre-trained model from google, and put to a folder(e.g.BERT_BASE_DIR)

chinese_L-12_H-768_A-12 from <a href='https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip'>bert</a>

2.secondly, you need to have training data(e.g. train.tsv) and validation data(e.g. dev.tsv), and put it under a

 folder(e.g.TEXT_DIR ). you can also download data from here <a href='https://pan.baidu.com/s/1ZS4dAdOIAe3DaHiwCDrLKw'>data to train bert for AI challenger-Sentiment Analysis</a>.
  
 it contains processed data you can run for both fine-tuning on sentiment analysis and pre-train with Bert. 
  
 it is generated by following this notebook step by step:
  
 preprocess_char.ipynb 
  
 you can generate data by yourself as long as data format is compatible with 
  
 processor SentimentAnalysisFineGrainProcessor(alias as sentiment_analysis); 


 data format:  label1,label2,label3\t here is sentence or sentences\t
 
 it only contains two columns, the first one is target(one or multi-labels), the second one is input strings.
  
 no need to tokenized.
 
 sample:"0_1,1_-2,2_-2,3_-2,4_1,5_-2,6_-2,7_-2,8_1,9_1,10_-2,11_-2,12_-2,13_-2,14_-2,15_1,16_-2,17_-2,18_0,19_-2 浦东五莲路站，老饭店福瑞轩属于上海的本帮菜，交通方便，最近又重新装修，来拨草了，饭店活动满188元送50元钱，环境干净，简单。朋友提前一天来预订包房也没有订到，只有大堂，五点半到店基本上每个台子都客满了，都是附近居民，每道冷菜量都比以前小，味道还可以，热菜烤茄子，炒河虾仁，脆皮鸭，照牌鸡，小牛排，手撕腊味花菜等每道菜都很入味好吃，会员价划算，服务员人手太少，服务态度好，要能团购更好。可以用支付宝方便"
 
 check sample data in ./BERT_BASE_DIR folder 

 for more detail, check create_model and SentimentAnalysisFineGrainProcessor from run_classifier.py

Pre-train Bert model based on open-souced model, then do classification task

generate raw data: [ADD SOMETHING HERE]

take sure each line is a sentence. between each document there is a blank line.

you can find generated data from zip file.
```
 use write_pre_train_doc() from preprocess_char.ipynb 
```

generate data for pre-train stage using:

export BERT_BASE_DIR=./BERT_BASE_DIR/chinese_L-12_H-768_A-12
nohup python create_pretraining_data.py \
--input_file=./PRE_TRAIN_DIR/bert_*_pretrain.txt \
--output_file=./PRE_TRAIN_DIR/tf_examples.tfrecord \
--vocab_file=$BERT_BASE_DIR/vocab.txt \
--do_lower_case=True \
--max_seq_length=512 \
--max_predictions_per_seq=60 \
--masked_lm_prob=0.15 \
--random_seed=12345 \
--dupe_factor=5 nohup_pre.out &

pre-train model with generated data:

python run_pretraining.py
fine-tuning

python run_classifier.py

TextCNN

download <a href='https://pan.baidu.com/s/19aMHbPgfpBxz9sS-sYsjOg'>cache file of sentiment analysis(tokens are in word level)</a>
train the model:

python train_cnn_fine_grain.py

 cache file of TextCNN model was generate by following steps from preprocess_word.ipynb.

 it contains everything you need to run TextCNN.
 
 it include: processed train/validation/test set; vocabulary of word; a dict map label to index. 
 
 take train_valid_test_vocab_cache.pik and put it under folder of preprocess_word/
 
 raw data are also included in this zip file.

Pre-train TextCNN

pre-train TextCNN with masked language model

python train_cnn_lm.py
fine-tuning for TextCNN

python train_cnn_fine_grain.py

Deploy BERT for online prediction

with session and feed style you can easily deploy BERT.

<a href='https://github.com/brightmart/bert_language_understanding/blob/master/run_classifier_predict_online.py'>online prediction with BERT, check more from here</a>

Reference

<a href='https://arxiv.org/pdf/1810.04805.pdf'>Bidirectional Encoder Representations from Transformers for Language Understanding</a>
<a href='https://github.com/google-research/bert'>google-research/bert</a>
<a href='https://github.com/pengshuang/AI-Comp'>pengshuang/AI-Comp</a>
<a href='https://github.com/AIChallenger/AI_Challenger_2018'>AI Challenger 2018</a>
<a href='https://arxiv.org/abs/1408.5882'>Convolutional Neural Networks for Sentence Classification</a>