Abstract

This thesis proposes a novel method to estimate realistic-looking environment images from an input face image. Having correct light information is crucial for a variety of virtual and mixed reality applications, but training deep neural networks to calculate this information requires large datasets, which are not easily obtainable for pairs of face images and corresponding environment maps.We address this problem by creating a synthetic dataset using digital human characters from the MetaHuman framework. These human characters are illuminated by environment maps obtained from different sources and rendered using Unreal Engine. Through parameter augmentation, we achieve a diverse dataset of over 150000 face images with high-quality light information.Using this dataset, we trained a CNN to estimate the brightness of a scene given a single face image. The network is able to identify the most dominant light directions for most indoor and outdoor scenes, but sometimes fails in generating output that topologically matches the layout of equirectangular environment images. For unseen real-life examples of outdoor scenes, it was able to correctly identify the position of the sun.To enable generating realistic-looking images from text input, we finetuned a pretrained diffusion network on environment images. The text prompts are generated from face images using existing image-to-text models. By adding the estimated brightness images from our CNN, we can guide the model to follow the layout of the original scenes.Our final proposed pipeline is therefore a sequential combination of multiple different neural networks, starting from a single face image. First, the brightness of the surrounding scene is estimated from the face image with a CNN. Using the same face image, a text prompt that describes the surrounding scene is generated using a pretrained image-to-text model. Then, the text prompt is fed to a finetuned diffusion network which is additionally conditioned by the estimated brightness image. This yields a modular system for estimating the surrounding environment from a single image of a human face.

Reference

Hochhauser, P. (2024). Deep Learning-based Light Source Estimation from Face Images [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2024.120596