AI, Deep Learning Basics/Basic

Training Tips for the Transformer Model

This article is the summary for Training Tips for the Transformer Model and Advanced Techiniques for Fine-Tuning Transformers.
  1. Training data preprocessing
    1. A higher batch size my be beneficial for the training and the batch size can be higher when excluding training sentences longer than a given threshold. It may be a good idea to exclude too long sentences.
  2. Training data size
    1. Comparing different datasets (e.g. smaller and cleaner vs. bigger and noisier), we need to train long enough because results after first hours (or days if training on a single GPU) may be misleading.
    2. For large training data, BLEU improves even after one week of training on eight GPUs.
      1. BELU is a NLP evaluation metric.
  3. Model size
    1. Prefer the BIG over the BASE model if you plan to train longer than one day and have 11GB (or more) memory available on GPU.
    2. With less memory you should benchmark BIG and BASE with the maximum possible batch size.
  4. Max_length
    1. Set a reasonably low max_length: This allows to use a higher batch size and prevents out-of-memory errors after several hours of training.
    2. Set a reasonably high max_length
  5. Batch size
    1. Batch size should be set as high as possible while keeping a reserve for not hitting the OOM errors.
  6. Learning rate and Warmup steps
    1. In case of diverged training, try gradient clipping and/or more warmup steps.
    2. If that does not help (or if the warmup steps are too high relative to the expected total training steps), try decreasing the learning rate.
    3. Note that when you decrease warmup steps (and keep learning rate), you also increase the maximum actual learning rate.
  7. Number of GPUS
    1. For the fastest BELU convergence use as many GPUs as possible.
    2. Keep the learning rate parameter at its optimal value found in single-GPU experiments.