Home

Awesome

BiT

This repository contains the training code of BiT introduced in our work: "BiT: Robustly Binarized Multi-distilled Transformer"

In this work, we identify a series of improvements which enables binary transformers at a much higher accuracy than what was possible previously. These include a two-set binarization scheme, a novel elastic binary activation function with learned parameters, and a multi-step distilation method. These approaches allow for the first time, fully binarized transformer models that are at a practical level of accuracy, approaching a full-precision BERT baseline on the GLUE language understanding benchmark within as little as 5.9%.

<div align=center> <img width=60% src="https://github.com/facebookresearch/bit/blob/main/overview.jpg"/> </div>

Citation

If you find our code useful for your research, please consider citing:

@article{liu2022bit,
title={BiT: Robustly Binarized Multi-distilled Transformer},
author={Liu, Zechun and Oguz, Barlas and Pappu, Aasish and Xiao, Lin and Yih, Scott and Li, Meng and Krishnamoorthi, Raghuraman and Mehdad, Yashar},
journal={arXiv preprint arXiv:2205.13016},
year={2022}
}

Run

1. Requirements:

2. Data:

3. Pretrained models:

4. Steps to run:

Models

1. GLUE dataset

(1) Without data augmentation

Method#BitsSize (M)FLOPs (G)MNLI m/mmQQPQNLISST-2CoLASTS-BMRPCRTEAvg
BERT32-32-3241822.584.9/85.591.492.193.259.790.186.372.283.9
BinaryBert1-1-416.51.583.9/84.291.290.992.344.487.283.365.379.9
BinaryBert1-1-216.50.862.7/63.979.952.682.514.66.568.352.753.7
BinaryBert1-1-116.50.435.6/35.366.251.553.206.168.352.741.0
BiBert1-1-113.40.466.1/67.584.872.688.725.433.672.557.463.2
BiT *1-1-413.41.583.6/84.487.891.391.542.086.386.866.479.5
BiT *1-1-213.40.882.1/82.587.189.390.832.182.278.458.175.0
BiT *1-1-113.40.477.1/77.582.985.787.725.171.179.758.871.0
BiT1-1-113.40.479.5/79.485.486.489.932.97279.962.173.5

(2) With data augmentation

Method#BitsSize (M)FLOPs (G)MNLI m/mmQQPQNLISST-2CoLASTS-BMRPCRTEAvg
BinaryBert1-1-216.50.862.7/63.9*79.9*51.089.633.011.471.055.957.6
BinaryBert1-1-116.50.435.6/35.3*66.2*66.178.37.322.169.357.748.7
BiBert1-1-113.40.466.1/67.5*84.8*76.090.937.856.778.861.068.8
BiT *1-1-213.40.882.1/82.5*87.1*88.892.543.286.390.472.980.4
BiT *1-1-113.40.477.1/77.5*82.9*85.091.532.084.188.067.576.0
BiT1-1-113.40.479.5/79.4*85.4*86.592.338.284.28869.778.0

2. SQuAD dataset

Method#BitsSQuADv1.1 em/f1
BERT32-32-3282.6/89.7
BinaryBert1-1-477.9/85.8
BinaryBert1-1-272.3/81.8
BinaryBert1-1-11.5/8.2
BiBert1-1-18.5/18.9
BiT1-1-163.1/74.9

Acknowledgement

The original code is borrowed from BinaryBERT.

Contact

Zechun Liu, Reality Labs, Meta Inc (liuzechun0216 at gmail.com)

License

BiT is CC-BY-NC 4.0 licensed as of now.