Propose a neueal network model that can transfer an audio signal of source person into a talking face video of target person with head pose and lip synchronization
Memory-argmented GAN module can generate photo-realistic video frames for various face identities
After training a general mapping based on a publicly dataset, a new short video can be used to fine-tune the mapping so it can fit any person