Training Tips for the Transformer Model

This article is the summary for Training Tips for the Transformer Model and Advanced Techiniques for Fine-Tuning Transformers.

Training data preprocessing
1. A higher batch size my be beneficial for the training and the batch size can be higher when excluding training sentences longer than a given threshold. It may be a good idea to exclude too long sentences.
Training data size
1. Comparing different datasets (e.g. smaller and cleaner vs. bigger and noisier), we need to train long enough because results after first hours (or days if training on a single GPU) may be misleading.
2. For large training data, BLEU improves even after one week of training on eight GPUs.
  1. BELU is a NLP evaluation metric.
Model size
1. Prefer the BIG over the BASE model if you plan to train longer than one day and have 11GB (or more) memory available on GPU.
2. With less memory you should benchmark BIG and BASE with the maximum possible batch size.
Max_length
1. Set a reasonably low max_length: This allows to use a higher batch size and prevents out-of-memory errors after several hours of training.
2. Set a reasonably high max_length
Batch size
1. Batch size should be set as high as possible while keeping a reserve for not hitting the OOM errors.
Learning rate and Warmup steps
1. In case of diverged training, try gradient clipping and/or more warmup steps.
2. If that does not help (or if the warmup steps are too high relative to the expected total training steps), try decreasing the learning rate.
3. Note that when you decrease warmup steps (and keep learning rate), you also increase the maximum actual learning rate.
Number of GPUS
1. For the fastest BELU convergence use as many GPUs as possible.
2. Keep the learning rate parameter at its optimal value found in single-GPU experiments.

저작자표시 비영리 동일조건

'AI, Deep Learning Basics > Basic' 카테고리의 다른 글

[기초] Server/Docker/CPU,GPU,TPU,NPU/CUDA/RAM 개념 (1)	2024.12.10
Training on GPU, CPU (0)	2022.11.05
(작성중) [AI602] 3. Bayesian Deep Learning (0)	2022.10.01
[AI602] 2. Self-supervised Learning (0)	2022.10.01
[AI602] 1. Vision Transformer (0)	2022.09.11

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Training Tips for the Transformer Model

'AI, Deep Learning Basics > Basic' 카테고리의 다른 글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역