Abstract

Media understanding is the domain of creating human-like perception of media objects in computers. This thesis addresses three main topics within this area: similarity detection of human faces, object recognition (body part detection in images, in particular) and fast speech recognition. In order to illustrate the practical purpose and the benefit of media understanding for the target group (undergraduate students in computer science) exemplary applications are implemented and discussed. The first part of this thesis elaborates on the theory behind the methods applied. Based on existing systems the most suitable features and classifiers for a given problem statement are investigated. The second section of the thesis deals with the practical implementation. In the first application the measurement of the similarity of prominent faces is investigated. Template matching is used as a method for calculating similarity. In the second application different body parts, recorded with a webcam, are detected and classified. The software uses a local feature extraction method for this task. For the accurate classification, a probabilistic method is applied (among others). The third application facilitates verbal interaction with the computer. The user is required to simulate the sound of a given species. Then, the system confirms whether the user's interpretation matches the one of the animal. In the same way the sound can be presented to users, asking them to reply with the correct animal name.
Here, the software uses spectral audio features, which are recognized by dynamic time warping. All three applications together prove that media understanding can be implemented successfully today.

Reference

Fried, A. (2012). Interaktive Analyse audiovisueller Medien [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://resolver.obvsg.at/urn:nbn:at:at-ubtuw:1-48421