Transformer: an all-attention model for encoder-decoder framework without any recurrences or convolutions.
Attention
Self-attention (문장 내에서의 연결)
Scaled dot-product attention (with keys, values, and queries). Self-attention learns to encode a word at a certain position by learning what other words to focus on to better understand it.
Done via matrix computations, which are fast to compute using GPUs.Attention: Replaces the existing RNN-based encoder and decoder with the attention mechanism
Multi-head attention (현재 문장 밖의 representation과의 연결)
Multi-head attention jointly attend to information from different representation subspaces at different positions.
The decoder uses encoder-decoder attention to focus on the relevant part of the input sequence at each of its decoding layer.
The decoder learns the alignment between the source and target by assigning key and value with the encoder outputs and querying with the decoder hidden state. This can be done by decoding the output from the previous timestep as an input, so self-attention only attends to earlier positions in the output sequence.
Encoder-decoder framework
The encoder allows encoding the given sequence all at once, in parallel
The decoder uses encoder-decoder attention
Applications of Transformer
Vision Transformer
Use the Transformer encoder (especially, self-attention) for visual recognition.
For inputs, they use linear projections for 1D position embeddings.
Insights: Much less image-specific inductive bias than CNNs, on the point that they use global self-attention layers.
ConvNet vs. Transformer
Convolutional layers tend to have better generalization with faster convergence thanks to the strong prior of inductive bias.
Attention layers have a higher model capacity that can benefit from larger datasets. -> Large data requirement
ConvNet
Vision Transformer
Input-independent parameter of static value
Dynamically adapt to the input
Translation equivariant
Lack (due to the absolute positional embeddings)
Local receptive fields
Global receptive fields (better capture the context)