Abstract
This cumulative thesis presents research in the field of tracking. Tracking is one of the most thoroughly researched problems in computer vision. The aim of tracking is to follow an object of interest (target) in a video. In this thesis, I focus on a special problem: tracking related multiple targets. Two important questions in tracking are: What is the target? and Where is the target? The core contributions of this thesis answer these two questions with the help of graph-based representations and methods. The first core contribution is a fully automatic initialization for target models (What?), based on the principal that things which move together belong together. The input of the approach is a video showing the targets in motion. In this video a set of salient points is tracked to extract the necessary motion information in the form of trajectories. A triangulated graph is built based on the initial positions of the tracked points (i.e. 2D positions in the first frame). Then, the triangulated graph is deformed based on the motion encoded in the trajectories. This deformation of the triangulation over time is the input of a hierarchical grouping process, which is realized by an irregular dual graph pyramid. In the top level of the resulting pyramid the rigid entities (e.g. body parts of a human body) are identified. Finally, the motion of these rigid entities is analyzed to find possible points of articulation connecting them (e.g. upper and lower arm of a human). The second core contribution is a novel approach for finding temporal correspondences of multiple related targets (Where?). This thesis proposes to represent the targets by a graph model, where each target is represented by a vertex and their relationships are encoded by edges. The traditional solution to find the temporal correspondences of a graph model is graph matching. In contrast to that, this thesis proposes a novel approach, which finds the correspondence of each vertex (target) by combining the appearance cue of a simple tracker with the structural cue deduced from a graph model. These two cues are combined in an iterative process inspired by the well-known Mean Shift algorithm. The outcome are correspondences for all vertices and edges in the graph, which locally maximize the similarity in appearance and locally minimize the deviation from the structure encoded in the model. Finally, the main goal of this thesis is to show the potential of graph-based representations and methods in tracking. This goal has been achieved through these two core contributions.
Reference
Artner, N. M. (2013). Tracking related multiple targets in videos [Dissertation, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2013.22627