Video object recognition based on deep learning (2019) – Research Unit Virtual & Augmented Reality

Abstract

In this master thesis we designed a client server system for automatic billboard recognition in video streams. The client side is represented by an Android application which serves the purpose of collecting various video data streams for the server side. For the server side a deep neural network, called StefanNet, was designed. StefanNet is a fully convolutional neural network which is able to properly classify and localize billboard objects within a video frame. StefanNet has a feature extractor which contains 23 convolutional layers and uses a single shot detector (SSD) as an object detector. StefanNet has been trained on the self-designed BillboardDataset which contains 4042 image samples taken from the billboards located throughout the metro stations in Vienna. Additionally, data augmentation techniques have been implemented to artificially augment the dataset with a 25% increase rate. Furthermore, the compression-based quantization technique has been applied to the StefanNet model to reduce the bit-width necessary for storing the weights of the network from float32 to float16. We evaluated the performance of StefanNet by comparing against the state-of-the-art networks ResNet, MobileNet, Inception and VGG16. The validation dataset contains both side and frontal views of the billboards. StefanNet achieved 91% mean average precision (mAp) on the test dataset, 98% mAp on the frontal view validation dataset and 82% mAp on the side view validation dataset. The inference rate was 40 FPS on a Nvidia 1080 graphics card. The quantized version of the StefanNet model achieved 91% mAp on the test dataset, 96% mAp on the frontal view validation dataset and 85% mAp on the side view validation at an inference rate of 45 FPS. In comparison to the other evaluated networks both the StefanNet model and the quantized version of the model produce superior results and outperform the benchmark network models on all datasets. This confirms that the architecture of StefanNet is currently the most suitable for the specific problem of automatic billboard detection in video streams.

Reference

Stojanoski, S. (2019). Video object recognition based on deep learning [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2019.55534