Playing with Realistic Neural Talking Head Models

4 min readMar 21, 2020

Researchers at the Samsung AI Center in Moscow (Russia) have recently presented interesting work called Living portraits: they made Mona Lisa and other subjects of photos and art alive using video of real people. They presented a framework for meta-learning of adversarial generative models called “Few-Shot Adversarial Learning”.

You can read more about details in the original paper.

Here we review this great implementation of the algorithm in PyTorch. The author of this implementation is Vincent Thévenin - research worker in De Vinci Innovation Center.

Starting with Realistic-Neural-Talking-Head-Models

Clone the repo and move to the project root.

Download required files from here. You will get two files: Pytorch_VGGFACE_IR.py (PyTorch code) and Pytorch_VGGFACE.pth (Pytorch model).

We will not train the model as author provided own pretrained weights on Google Drive.

Install required matplotlib, opencv and face_alignment libraries

pip install matplotlib opencv-python face_alignment
sudo apt-get install python-tk

We also need to install NVIDIA driver required to run embedder_inference.py. Download NVIDIA driver for your GPU from here. For instance for Tesla K80:

wget http://us.download.nvidia.com/tesla/440.33.01/NVIDIA-Linux-x86_64-440.33.01.run

Make run file executable and install the driver

chmod +x NVIDIA-Linux-x86_64-440.33.01.run
sudo ./NVIDIA-Linux-x86_64-440.33.01.run

In some cases this variant fails due to X server. Alternative variant to install NVIDIA driver is following. Identify the recommended graphics driver for your system

ubuntu-drivers devices

And install the recommended NVIDIA driver and reboot

sudo apt-get install nvidia-384

Finally we need to install PyTorch according to version of CUDA installed. Get the version of CUDA installed

nvcc --version

Let’s say we have CUDA v10.1

Cuda compilation tools, release 10.1, V10.1.243

Go to page https://pytorch.org/ and select appropriate instruction to install PyTorch

pip install torch torchvision

Let’s run the embedder (embedder_inference.py) on videos or images of a person and get embedding vector :

python embedder_inference.py

Output:

Saving e_hat...
...Done saving

You will get two files: e_hat_images.tar and e_hat_video.tar. Let’s run finetuning_training.py:

python finetuning_training.py

Output:

What source to finetune on?
0: Video
1: Images

Enter 1 for images

Downloading: "https://download.pytorch.org/models/vgg19-dcbb9e9d.pth" to /home/vladimir/.cache/torch/checkpoints/vgg19-dcbb9e9d.pth
...
avg batch time for batch size of 1 : 0:00:12.463398
[0/40][0/1]     Loss_D: 2.0553  Loss_G: 5.2649  D(x): 1.0553    D(G(y)): 1.0553

At the end you will get similar image

It’s seen that the quality is bad.

Try training on video

avg batch time for batch size of 1 : 0:00:13.738919
[0/40][0/1]     Loss_D: 2.0539  Loss_G: 3.3976  D(x): 1.0539    D(G(y)): 1.0539

At that time losses are lower and the final results look better.

Result of finetuning_training on own image

And original

Here is a script webcam_inference.py available for testing image generation on video from camera. Run the webcam_inference.py

python webcam_inference.py

This script will run the model using person from embedding vector and webcam input, performs only inference. The script generates three images: facial landmark, original (me) and fake. This time inference is done for model fine tuned on video.

Now try it when model fine tuned on images

Inference performs very slow. That took few minutes to start on GCE VM with NVIDIA GPU.

We can retrain the generator on videos to get better results. Author of the project stated that the generator was trained on 5 epochs which is not optimal.

For training you can use VoxCeleb2 dataset. To download the dataset you should request access to dataset filling the form here. I have a github repo with bash script for downloading all parts of the dataset. Run script to download dataset:

sh download.sh

Note: each part of dataset weights 30GB (270GB in total).

Once all parts are downloaded concatenate the files to single zip archive:

cat vox2_dev* > vox2_mp4.zip

Now change a path to mp4 folder with training videos in file train.py (line 21):

path_to_mp4 = '../../Data/vox2_mp4/dev/mp4'
dataset = VidDataSet(K=8, path_to_mp4 = path_to_mp4, device=device)

Now run the train.py script

python train.py

This should print out

Initiating new checkpoint...
...Done
Downloading the face detection CNN. Please wait...
Downloading the Face Alignment Network(FAN). Please wait...

If you get error

ImportError: No module named dataset.dataset_class

use Python v3. You need to install all the required libraries above using pip3.

Playing with Realistic Neural Talking Head Models

Starting with Realistic-Neural-Talking-Head-Models

Written by Privalov Vladimir

Responses (1)