[AI602] 1. Vision Transformer
AI, Deep Learning Basics/Basic

[AI602] 1. Vision Transformer

https://wikidocs.net/167211

  • Transformer: an all-attention model for encoder-decoder framework without any recurrences or convolutions.
    1. Attention
      1. Self-attention (문장 내에서의 연결)
        • Scaled dot-product attention (with keys, values, and queries). Self-attention learns to encode a word at a certain position by learning what other words to focus on to better understand it. 
        • Done via matrix computations, which are fast to compute using GPUs.Attention: Replaces the existing RNN-based encoder and decoder with the attention mechanism
      2. Multi-head attention (현재 문장 밖의 representation과의 연결)
        • Multi-head attention jointly attend to information from different representation subspaces at different positions. 
      3. Encoder-decoder attention (output을 내는데 연관되는 input part과의 연결)
        • The decoder uses encoder-decoder attention to focus on the relevant part of the input sequence at each of its decoding layer.
        • The decoder learns the alignment between the source and target by assigning key and value with the encoder outputs and querying with the decoder hidden state. This can be done by decoding the output from the previous timestep as an input, so self-attention only attends to earlier positions in the output sequence.
    2. Encoder-decoder framework
      • The encoder allows encoding the given sequence all at once, in parallel
      • The decoder uses encoder-decoder attention
  • Applications of Transformer
    • Vision Transformer
      • Use the Transformer encoder (especially, self-attention) for visual recognition.
      • For inputs, they use linear projections for 1D position embeddings.
      • Insights: Much less image-specific inductive bias than CNNs, on the point that they use global self-attention layers.
  • ConvNet vs. Transformer
    • Convolutional layers tend to have better generalization with faster convergence thanks to the strong prior of inductive bias.
    • Attention layers have a higher model capacity that can benefit from larger datasets. -> Large data requirement
ConvNet Vision Transformer
Input-independent parameter of static value Dynamically adapt to the input
Translation equivariant Lack (due to the absolute positional embeddings)
Local receptive fields Global receptive fields (better capture the context)

'AI, Deep Learning Basics > Basic' 카테고리의 다른 글

(작성중) [AI602] 3. Bayesian Deep Learning  (0) 2022.10.01
[AI602] 2. Self-supervised Learning  (0) 2022.10.01
[AI602] AdvancedML Introduction  (0) 2022.09.11
[Logger] wandb 사용법  (0) 2022.05.05
Methodology skeletons  (0) 2022.03.19