Abstract
This master’s thesis investigates the weakly-supervised visual object detection from a given set of images. The main goal is set as obtaining an optimal object model for any selected visual object by learning from positive- and negative-labeled images. To this end, an analysis process is proposed that gathers segments of the target visual object from positive training images. A more common object model is built by determining the most discriminative detected object segments. This ultimate form of the object model is employed by a binary classifier in order to detect segments of a target visual object from test images. The proposed approach for the recovery of an optimal object model comprises of four major processing steps: segmentation, feature extraction, similarity measurement, and learning. For each of these steps, the suitability of different techniques is evaluated. Firstly, an evaluation with respect to segmentation is made with mean shift segmentation and a simpler and faster sliding window approach. Secondly, different types of features (color histograms, dense SIFT descriptors, PHOW descriptors, VLAD descriptors, CEDD and MPEG-7 color descriptors) are evaluated for the description of the segments obtained in the first step. Thirdly, in similarity measurement an evaluation involves different distance and similarity functions. Lastly, in the learning step a non-parametric discriminative learning scheme based on information gain is employed. The result of learning is a ranking that expresses the distinctiveness of each candidate segment. In addition, MPEG video encoding is investigated as an alternative technique for both feature extraction and similarity measurement. For this purpose, an approach originating from texture classification is extended to color image segments. The experimental results demonstrate lower computational complexity for all combinations of investigated feature descriptors and distance functions compared to the MPEG video encoding-based approach. Furthermore, the accuracy of visual object models obtained by MPEG video encoding is lower than those presented in some of the proposed approaches employing separate processing steps for feature extraction and similarity measurement. These master thesis results suggest using the feature descriptors obtained from VLAD descriptors and one of three distance functions: chi square statistics, diffusion distance, and Euclidean distance. Moreover, an evaluation of target object detectors is performed by selecting one of the recovered object models for each evaluation. The target object detector is an SVM classifier, a Naïve Bayes classifier, and an alternative approach employing information gain to learn decision thresholds.
Reference
Ince, S. A. (2017). Weakly-supervised learning of visual object models [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2017.51247