Awesome
FastGECToR
1. Introduction
A faster and simpler implementation of GECToR – Grammatical Error Correction: Tag, Not Rewrite with amp and distributed support by deepspeed.
Note: To make it faster and more readable, we remove allennlp dependencies and reconstruct related codes.
2. Requirements
- Install Pytorch with cuda support
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
- Install NVIDIA-Apex with CUDA and C++ extensions
- Install the rest packages with
pip install -r ./requirements.txt
3. Data Processing
- Tokenize your data (one sentence per line, split words by space)
- Generate edits from parallel sents
bash scripts/prepare_data.sh
- (Optional) Define your own target vocab (data/vocabulary/labels.txt)
4. Configuration
- We use deepspeed configs to support distributed training and fp16/bf16 data types. Please refer to the deepspeed config-json for more details.
- We use max_num_tokens to limit the max length of a sequence after tokenization instead of max_len (in word level) as it's not aligned with the length of input_ids and might be misleading.
--segmented 1
should be used if your input file has spaces between words. If it's set to 0, it will split words by char.
5. Training
- Edit deepspeed_config.json according to your config params. Edit deepspeed_config.json according to your config params. lr, train_batch_size, gradient_accumulation_steps will be inherited from deepspeed config file.
bash scripts/train.sh
* Performance Tuning
-
Suppose you want to train a GECToR model with bert-base-uncased (with 110M params), and the max seq len is set to 256 for all cases. There’re some configurations you may need to consider in order to achieve better performance / efficiency.
-
The basic config is to use single GPU without any tricks. Then you may get the following statistics.
global batch size n_gpus MaxMemAllocated (CUDA) GPU Mem Usage (NVIDIA-SMI) 8 1 3.3GB 5880MiB 16 1 5.33GB 7610MiB 32 1 9.28GB 11712MiB 64 1 17.28GB 20344MiB 128 1 33.25GB 36654MiB 256 1 65.21GB 69864MiB -
As you can see, The max batch size you can set is limited by the GPU memory allocation. The simplest way to get a larger batch size is to use gradient accumulation, which accumulates the gradients several steps and update at a given interval. In this case, you can reduce the memory usage a lot.
global batch size effective batch size gradient accumulation steps n_gpus MaxMemAllocated (CUDA) GPU Mem Usage (NVIDIA-SMI) 256 256 1 1 65.21GB 69864MiB 256 128 2 1 33.68GB 36654MiB 256 64 4 1 17.71GB 20152MiB 256 32 8 1 9.7GB 12344MiB 256 16 16 1 5.76GB 8018MiB 256 8 32 1 3.72GB 5872MiB -
Another way to train with a large batch size is to use data parallel strategy, which make model replicas and data batch slices across DP ranks to alleviate the memory consumed per GPU.
global batch size n_gpus MaxMemAllocated (CUDA) Per GPU Mem Usage (NVIDIA-SMI) 256 1 65.21GB 69864MiB 256 2 33.25GB 37038MiB 256 4 17.28GB 21160MiB 256 8 9.28GB 12616MiB -
It’s also possible to further reduce the memory usage. For example, you can use FP16 data types for training efficiently at the cost of lower precision. Furthermore, deepspeed’s zero optimizations can also be used alone / together in distributed training. Note that for small models, higher zero stages may not help. For most cases, zero1 (optimizer states partitioning) is enough.
global batch size n_gpus use fp16 use zero1 MaxMemAllocated (CUDA) Per GPU Mem Usage (NVIDIA-SMI) 256 1 False False 65.21GB 69864MiB 256 1 True False 35.18GB 38594MiB 256 8 False False 9.28GB 12616MiB 256 8 True False 5.71GB 9066MiB 256 8 False True 8.59GB 12172MiB 256 8 True True 4.64GB 7610MiB -
There are other strategies to maximize hardware usage to gain a better performance. Check https://www.deepspeed.ai/ for more details.
6. Inference
- Edit deepspeed_config.json according to your config params
bash scripts/predict.sh
Reference
[1] Omelianchuk, K., Atrasevych, V., Chernodub, A., & Skurzhanskyi, O. (2020). GECToR – Grammatical Error Correction: Tag, Not Rewrite. arXiv:2005.12592 [cs]. http://arxiv.org/abs/2005.12592