Everybody Dance Now
#MotionTransfer #videotovideotranslation #GAN
Dataset: a variety of videos, enabling trained amateurs to spin and twirl like ballerinas, perform martial arts kicks, or dance as vibrantly as pop stars
Transfer motion between two video subjects in a frame-by-frame manner > mapping between images of two inidividuals > discover an image-to-image translation between source and target set
“Do as I do” motion transfer: transfer performance to a novel target < Video-to-video translation using pose as an intermediate representation
- Extract pose from [the source subject]
- Apply the learned pose-to-appearance mapping to generate [the target subject]
Predict two consecutive frames for temporally coherent video results / a separate pipeline for realistic face synthesis
Related Works
Extract motion
Problem: unlikely to have an exact frame(due to body shape, motion style unique to each other)
Solution: key point-based pose > use Pose stick figures by OpenPose, DensePose
Input: pose stick figures from the source from the trained model
Image-to-image translation
Disentangle motion from appearance and Synthesize video
- MoCoGAN: unsupervised adversarial training to learn this separation and generates videos of subjects performing novel motions or facial expression
- Dynamics Transfer GAN 부문 (facial expressions from a source image > a video onto a target person )
- 그 외 models
- Pix2pix, CoGAN, UNIT, CycleGAN, DiscoGAN, Cascaded Refinement Networks, pix2pixHD
Method
Pose Detection
Pretrained state-of-the-art detector to create pose stick figures (Related Works - Extract motion)
Pose Encoding and Normalization
Accounts for differences between the source and target body shapes and locations within the frame
Encoding body poses
Use pre-trained pose detectionGlobal pose normalization
Transform the pose key points of the source person so that they appear in accordance with the target person’s body shape and location as in the Transfer section
: by analyzing the heights and ankle positions for the poses of each subject and use a linear mapping btw the closest and farthest ankle positions
Pose to Video Translation
- Simple GAN Model (Create video sequence)
: predict two consecutive frames, by enforcing temporal coherence- training
간단히 말해서 fake G(x_t)와 real y_t를 구별할 수 있도록 coherence 학습된 Pose detector P 얻기 - transfer
일반 image가 들어오면 이를 동일하게 적용
- training
- FaceGAN
: to add more detail and realism to the face region
얼굴 면적(x_F)만을 취하여 GAN 모델에 적용하는 방식, pix2pix 방식을 적용하였다.
'AI, Deep Learning Basics > Computer Vision' 카테고리의 다른 글
[Generative Model] Variational AutoEncoder 1. Basic: AE, DAE, VAE (0) | 2021.12.06 |
---|---|
[Computer Vision] Image, Video 분야 subtask 및 데이터 종류 정리 (0) | 2021.12.01 |
[Basic] 3x3 Conv, 1x1 Conv 하는 이유(FCN vs. FC Layer vs. FPN) (0) | 2021.11.20 |
[Instance segmentation] Mask R-CNN/Detectron2 모델 파일 분석 (0) | 2021.11.09 |
[톺아보기] Pytorch를 이용한 Image Classifier 코드, Gradient Descent (0) | 2021.10.26 |