[AI602] 1. Vision Transformer

Transformer: an all-attention model for encoder-decoder framework without any recurrences or convolutions.
1. Attention
  1. Self-attention (문장 내에서의 연결)
    - Scaled dot-product attention (with keys, values, and queries). Self-attention learns to encode a word at a certain position by learning what other words to focus on to better understand it.
    - Done via matrix computations, which are fast to compute using GPUs.Attention: Replaces the existing RNN-based encoder and decoder with the attention mechanism
  2. Multi-head attention (현재 문장 밖의 representation과의 연결)
    - Multi-head attention jointly attend to information from different representation subspaces at different positions.
  3. Encoder-decoder attention (output을 내는데 연관되는 input part과의 연결)
    - The decoder uses encoder-decoder attention to focus on the relevant part of the input sequence at each of its decoding layer.
    - The decoder learns the alignment between the source and target by assigning key and value with the encoder outputs and querying with the decoder hidden state. This can be done by decoding the output from the previous timestep as an input, so self-attention only attends to earlier positions in the output sequence.
2. Encoder-decoder framework
  - The encoder allows encoding the given sequence all at once, in parallel
  - The decoder uses encoder-decoder attention
Applications of Transformer
- Vision Transformer
  - Use the Transformer encoder (especially, self-attention) for visual recognition.
  - For inputs, they use linear projections for 1D position embeddings.
  - Insights: Much less image-specific inductive bias than CNNs, on the point that they use global self-attention layers.
ConvNet vs. Transformer
- Convolutional layers tend to have better generalization with faster convergence thanks to the strong prior of inductive bias.
- Attention layers have a higher model capacity that can benefit from larger datasets. -> Large data requirement

ConvNet	Vision Transformer
Input-independent parameter of static value	Dynamically adapt to the input
Translation equivariant	Lack (due to the absolute positional embeddings)
Local receptive fields	Global receptive fields (better capture the context)

저작자표시 비영리 동일조건

'AI, Deep Learning Basics > Basic' 카테고리의 다른 글

(작성중) [AI602] 3. Bayesian Deep Learning (0)	2022.10.01
[AI602] 2. Self-supervised Learning (0)	2022.10.01
[AI602] AdvancedML Introduction (0)	2022.09.11
[Logger] wandb 사용법 (0)	2022.05.05
Methodology skeletons (0)	2022.03.19

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

[AI602] 1. Vision Transformer

'AI, Deep Learning Basics > Basic' 카테고리의 다른 글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역