Abstract

Videos are an integral part of current information technologies and the web. The demand for efficient retrieval rises with the increasing number of videos, and thus better annotation tools are needed as today's retrieval systems mainly rely on manually generated metadata.
The situation is even more critical when it comes to user-generated videos where rough and inaccurate annotations are the common practice.
Attempts to employ content-based analysis for video annotation and retrieval already exist, but they are still in an infant stage compared to the retrieval of web documents.
In this work, we address the use of object recognition techniques to annotate what is shown where in videos. These annotations are suitable to retrieve specific video scenes for object related text queries, thought the manual generation of such metadata would be impractical and expensive. A sophisticated presentation of the retrieval results is further exploited that indicates the relevance of the retrieved scenes at a first glance. The presented semi-automatic annotation approach can be used in an easy and comfortable way, and it builds on a novel framework with following outstanding features. First, it can be easily integrated into existing video environments. Second, it is not based on a fixed analysis chain but on an extensive recognition infrastructure that can be used with all kinds of visual features, matching and machine learning techniques. New recognition approaches can be integrated into this infrastructure with low development costs and a configuration of the used recognition approaches can be performed even on a running system. Thus, this framework might also benefit from future advances in computer vision. Third, we present an automatic selection approach to support the use of different recognition strategies for the annotation of different objects. Moreover, visual analysis can be performed efficiently on distributed, multi-processor environments and the resulting video annotations and low-level features can be stored in a compact form.
We demonstrate the proposed annotation approach in an extensive case study with promising results. A video object annotation prototype as well as the generated scene classification ground-truth are freely available to foster reproducible research. Additional contributions of this work consider the generation of motion-based and segmentation-based features and their use for specific annotation tasks, such as the detection of action scenes in professional and user-generated video.
Furthermore, we participated at the two tasks instance search and semantic indexing of the TRECVID challenge in the three consecutive years 2010, 2011, and 2012.

Reference

Sorschag, R. (2012). Intelligent video annotation and retrieval techniques [Dissertation, Technische Universität Wien]. reposiTUm. https://resolver.obvsg.at/urn:nbn:at:at-ubtuw:1-50967