A higher batch size my be beneficial for the training and the batch size can be higher when excluding training sentences longer than a given threshold. It may be a good idea to exclude too long sentences.
Training data size
Comparing different datasets (e.g. smaller and cleaner vs. bigger and noisier), we need to train long enough because results after first hours (or days if training on a single GPU) may be misleading.
For large training data, BLEU improves even after one week of training on eight GPUs.
BELU is a NLP evaluation metric.
Model size
Prefer the BIG over the BASE model if you plan to train longer than one day and have 11GB (or more) memory available on GPU.
With less memory you should benchmark BIG and BASE with the maximum possible batch size.
Max_length
Set a reasonably low max_length: This allows to use a higher batch size and prevents out-of-memory errors after several hours of training.
Set a reasonably high max_length
Batch size
Batch size should be set as high as possible while keeping a reserve for not hitting the OOM errors.
Learning rate and Warmup steps
In case of diverged training, try gradient clipping and/or more warmup steps.
If that does not help (or if the warmup steps are too high relative to the expected total training steps), try decreasing the learning rate.
Note that when you decrease warmup steps (and keep learning rate), you also increase the maximum actual learning rate.
Number of GPUS
For the fastest BELU convergence use as many GPUs as possible.
Keep the learning rate parameter at its optimal value found in single-GPU experiments.