Abstract
Making big, unstructured video data collections searchable fully automated and efficiently is a scientific task whose solution would be of big interest. Many data collections like film archives, online media centres, video surveillance archives and online learning platforms depend on an efficient search structure. It is common to use manually generated content indices for this purpose. These indices are produced by saving video frames including metadata like keywords or textual annotations. This task is extremely time-consuming. The produced indices are mostly inexact and incomplete. Huge amounts of information are lost for search and retrieval by this approach. Therefore the possibilities of the more current concept of „content based“ data search are investigated with this master’s thesis, as an example of using this approach for segmentation and classification of videos from Austrian parliament sessions. The aim is the automated and multimodal extraction of audio and image features for training appropriate classifiers in order to use them for classification of audio events and persons. The main focus of the classification lies in the detection of scenes where the atmosphere in the parliament chamber is different from the classical speech-atmosphere, which would be an evidence of interesting events during the sessions. The recognition of acting parliamentarians - including their facial expression - is the second big focus of this work. This paper starts with an overview of the basic principles of “content based” video retrieval including its subsections: video segmentation, feature extraction from image and audio data and classification. Furthermore, methods for the statistical evaluation of the results will be presented, followed by an overview of related research papers. Afterwards, an explanation of the implemented prototype on the basis of the chosen features and classification methods is given. Finally, the statistical evaluation of the classification results is introduced, which show that the „content based“ approach for feature extraction and classification is definitely appropriate for the detection of relevant events and persons in videos of parliament sessions without the need for complex, manual indexing in advance. It is shown that, in the case of parliament session videos, audio features are more significant than visual features. Focussing on the detection of audio events for the identification of relevant scenes has proved to be right for this reason. Especially the classification of facial expression has turned out to be problematic, because in many cases the expression is not distinctive enough for a correct evaluation.
Reference
Straka, K. (2018). Inhaltsbasierte Suchmaschine für Videos von Parlamentssitzungen [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2018.25683