Abstract

Recent developments of machine learning algorithms resulted in outstanding findings for many different fields of applied computer science. The superior object detection performance of convolutional neural networks leads to lots of different neural network types and architectures. This thesis explores the utilization of state of the art object detection networks to achieve real time semantic annotations within a reconstructed 3D scene. An existing reconstruction framework is extended to implement an universal interface for different neural network types. This allows for an easy exchange of the used neural network and enables fast integration of future developments. With object detection the geometric reconstruction is extended towards a semantic scene understanding. The automatic annotation and segmentation of scene objects can be used to assists the user with exploration tasks and enables interaction with scene objects. The existing framework allows the distant live exploration of a scanned environment in virtual reality. It is based on InfiniTAM and consists of three main modules. At server side an environment is scanned with a RGB-D camera to generate a reconstruction of the scene. This 3D representation is transmitted to the client side where it is triangulated to a mesh. Finally this mesh can be explored within virtual reality using the Unreal Engine. The RGB images of the camera stream are used as an input for a convolutional neural network. The object detection results, represented as 2D bounding boxes or segmentation masks, are projected onto the 3D surface reconstruction. Fundamental changes of the processing pipeline allow the use of fully convolutional segmentation networks with long processing times while keeping the live reconstruction and streaming capabilities of the framework. An extensive filtering pipeline and a novel voting algorithm optimize the segmentation of the scene objects. Finally annotated three-dimensional bounding boxes enclose detected scene objects in the reconstruction. Additionally generated colliders represent their coarse geometry. This enables efficient interaction with scene objects, increasing the immersion of the user. The SSD_Mobile_Net box detection network and the Mask-RCNN segmentation network are implemented to test the reconstruction framework against a ground truth. Each parameter of the filter pipeline is evaluated to optimize the performance of the developed framework. Numerical filters influence the overall detection rate, visual filters determine the spatial segmentation of scene objects. The fusion of 2D bounding boxes shows a better overall result than the projection of segmentation results. Guidelines provide advice for the integration of new neural networks.

Reference

Höller, B. (2019). Smart 3D geometry understanding within a dynamic large triangulated point cloud [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2019.53448