Near-Online Multi-target Tracking with Aggregated Local Flow …

Near-Online Multi-target Tracking with Aggregated Local Flow Descriptor

Wongun Choi NEC Laboratories America 10080 N. Wolfe Rd, Cupertino, CA, USA

wongun@nec-

arXiv:1504.02340v1 [cs.CV] 9 Apr 2015

Abstract

In this paper, we focus on the two key aspects of multiple target tracking problem: 1) designing an accurate affinity measure to associate detections and 2) implementing an efficient and accurate (near) online multiple target tracking algorithm. As the first contribution, we introduce a novel Aggregated Local Flow Descriptor (ALFD) that encodes the relative motion pattern between a pair of temporally distant detections using long term interest point trajectories (IPTs). Leveraging on the IPTs, the ALFD provides a robust affinity measure for estimating the likelihood of matching detections regardless of the application scenarios. As another contribution, we present a Near-Online Multi-target Tracking (NOMT) algorithm. The tracking problem is formulated as a data-association between targets and detections in a temporal window, that is performed repeatedly at every frame. While being efficient, NOMT achieves robustness via integrating multiple cues including ALFD metric, target dynamics, appearance similarity, and long term trajectory regularization into the model. Our ablative analysis verifies the superiority of the ALFD metric over the other conventional affinity metrics. We run a comprehensive experimental evaluation on two challenging tracking datasets, KITTI [15] and MOT [1] datasets. The NOMT method combined with ALFD metric achieves the best accuracy in both datasets with significant margins (about 10% higher MOTA) over the state-of-the-arts.

1. Introduction

The goal of multiple target tracking is to automatically identify objects of interest and reliably estimate the motion of targets over the time. Thanks to the recent advancement in image-based object detection methods [9, 12, 16, 32], tracking-by-detection [3, 5, 10, 23, 25] has become a popular framework to tackle the multiple target tracking problem. The advantages of the framework are that it naturally identifies new objects of interest entering the scene, that it can handle video sequences recorded using mobile plat-

t t + t

Distance: similar Appearance: similar Flows: dis4nc4ve

Figure 1. Bounding box distance and appearance similarity are popularly used affinity metrics in the multiple target tracking literature. However, in real-world crowded scenes, they are often ambiguous to successfully distinguish adjacent or similar looking targets. Yet, the optical flow trajectories provide more reliable measure to compare different detections across time. Although individual trajectory may be inaccurate (red line), collectively they provide strong information to measure the affinity. We propose a novel Aggregated Local Flow Descriptor that exploits the optical flow reliably in the multiple target tracking problem. The figure is best shown in color.

forms, and that it is robust to a target drift. The key challenge in this framework is to accurately group the detections into individual targets with high accuracy (data association), so one target could be fully represented by a single estimated trajectory. Mistakes made in the identity maintenance could result in a catastrophic failure in many high level reasoning tasks, such as future motion prediction, target behavior analysis, etc.

To implement a highly accurate multiple target tracking algorithm, it is important to have a robust data association model and an accurate measure to compare two detections across time (pairwise affinity measure). Recently, much work is done in the design of the data association algorithm using global (batch) tracking framework [3, 23, 25, 35]. Compared to the online counterparts [5, 7, 10, 20], these

1

at t1

at t2

at t3

Associa,on

Error

t--

t1

Detec)ons not observed yet

t2--

t2

t3--

t3

Figure 2. Our NOMT algorithm solves the global association problem at every time frame t with a temporal window . Solid circles represent associated targets, dashed circles represent unobserved detections, dashed lines show finalized target association before the temporal window, and solid lines represent the (active) association made in the current time frame. Due to the limited amount of observation, the tracking algorithm may produce an erroneous association at t2. But once more observation is provided at t3, our algorithm is capable of fixing the error made in t2. In addition, our method automatically identifies new targets on the fly (red circles). The figure is best shown in color.

methods have a benefit of considering all the detections over entire time frames. With a help of clever optimization algorithms, they achieve higher data association accuracy than traditional online tracking frameworks. However, the application of these methods is fundamentally limited to post-analysis of video sequences. On the other hand, the pairwise affinity measure is relatively less investigated in the recent literature despite its importance. Most methods adopt weak affinity measures (see Fig. 1) to compare two detections across time, such as spatial affinity (e.g. bounding box overlap or euclidean distance [2, 3, 28]) or simple appearance similarity (e.g. intersection kernel with color histogram [29]). In this paper, we address the two key challenging questions of the multiple target tracking problem: 1) how to accurately measure the pairwise affinity between two detections (i.e. likelihood to link the two) and 2) how to efficiently apply the ideas in global tracking algorithms into an online application.

As the first contribution, we present a novel Aggregated Local Flow Descriptor (ALFD) that encodes the relative motion pattern between two detection boxes in different time frames (Sec. 3). By aggregating multiple local interest point trajectories (IPTs), the descriptor encodes how the IPTs in a detection moves with respect to another detection box, and vice versa. The main intuition is that although each individual IPT may have an error, collectively they provide a strong information for comparing two detections. With a learned model, we observe that ALFD provides strong affinity measure, thereby providing strong cues for the association algorithm.

As the second contribution, we propose an efficient Near-Online Multi-target Tracking (NOMT) algorithm. Incorporating the robust ALFD descriptor as well as longterm motion/appearance models motivated by the success of modern batch tracking methods, the algorithm produces

highly accurate trajectories, while preserving the causality property and running in real-time ( 10 FPS). In every time frame t, the algorithm solves the global data association problem between targets and all the detections in a temporal window [t-, t] of size (see Fig. 2). The key property is that the algorithm is able to fix any association error made in the past when more detections are provided. In order to achieve both accuracy and efficiency, the algorithm generates candidate hypothetical trajectories using ALFD driven tracklets and solve the association problem with a parallelized junction tree algorithm (Sec. 4).

We perform a comprehensive experimental evaluation on two challenging datasets: KITTI [15] and MOT Challenge [1] datasets. The proposed algorithm achieves the best accuracy with a large margin over the state-of-the-arts (including batch algorithms) in both datasets, demonstrating the superiority of our algorithm. The rest of the paper is organized as follows. Sec. 2 discusses the background and related work in multiple target tracking literature. Sec. 3 describes our newly proposed ALFD. Sec. 4 presents overview of NOMT data association model and the algorithm. Sec. 5 discusses the details of model design. We show the analysis and experimental evaluation in Sec. 6, and finally conclude with Sec. 7.

2. Background

Given a video sequence V1T = {I1, I2, ..., IT } of length T and a set of detection hypotheses DT1 = {d1, d2, ..., dN }, where di is parameterized by the frame number ti, a bounding box (di[x], di[y], di[w], di[h])1, and the score si, the goal of multiple target tracking is to find a coherent set of targets (associations) A = {A1, A2, ..., AM }, where each target Am are parameterized by a set of detection indices

1[x], [y], [w], [h] operators represent the x, y, width and height value, respectively.

(e.g. A1 = {d1, d10, d23}) during the time of presence; i.e. (V1T , DT1 ) A.

2.1. Data Association Models

Most of multiple target tracking algorithms/systems can be classified into two categories: online method and global (batch) method.

Online algorithms [5, 7, 10, 20, 27] are formulated to find the association between existing targets and detections in the current time frame: (Vtt, Dtt, At-1) At. The advantages of online formulation are: 1) it is applicable to online/real-time scenario and 2) it is possible to take advantage of targets' dynamics information available in At-1. Such methods, however, are often prone to association errors since they consider only one frame when making the association. Solving the problem based on (temporally) local information can fundamentally limit the association accuracy. To avoid such errors, [5] adopts conservative association threshold together with detection confidence maps, or [7, 20, 27] model interactions between targets.

Recently, global algorithms [2, 3, 25, 28, 35] became much popular in the community, as more robust association is achieved when considering long-term information in the association process. One common approach is to formulate the tracking as the network flow problem to directly obtain the targets from detection hypothesis [3, 28, 35]; i.e. (V1T , DT1 ) AT . Although they have shown promising accuracy in multiple target tracking, the methods are often over-simplified for the tractability concern. They ignore useful target level information, such as target dynamics and interaction between targets (occlusion, social interaction, etc). Instead of directly solving the problem at one step, other employ an iterative algorithm that progressively refines the target association [2, 18, 23, 25]; i.e. (V1T , DT1 , ATi ) ATi+1, where i represent an iteration. Starting from short trajectories (tracklet), [18, 23] associate them into longer targets in a hierarchical fashion. [2, 25] iterate between two modes, association and continuous estimation. Since these methods obtain intermediate target information, targets' dynamics, interaction and high-order statistics on the trajectories could be accounted that can lead to a better association accuracy. However, it is unclear how to seamlessly extend such models to an online application.

We propose a novel framework that can fill in the gap between the online and global algorithms. The task is defined as to solve the following problem: (V1t, Dtt- , At-1) At in each time frame t, where is pre-defined temporal window size. Our algorithm behaves similar to the online algorithm in that it outputs the association in every time frame. The critical difference is that any decision made in the past is subject to change once more observations are available. The association problems in each temporal window are solved using a newly proposed global association algo-

rithm. Our method is also reminiscent of iterative global algorithm, since we augment all the track iteratively (one iteration per frame) considering multiple frames, that leads to a better association accuracy.

2.2. Affinity Measures in Visual Tracking

The importance of a robust pairwise affinity measure (i.e. likelihood of di and dj being the same target) is relatively less investigated in the multi-target tracking literature. Most of the recent literature [2, 3, 28, 29] employs a spatial distance and/or an appearance similarity with simple features (such as color histograms). In order to learn a discriminative affinity metric, Kuo et al. [23] introduces an online appearance learning with boosting algorithm using various feature inputs such as HoG [8], texture feature, and RGB color histogram. Milan et al. [25] and Zamir et al. [29] proposed to use a global appearance consistency measure to ensure a target has a similar (or smoothly varying) appearance over a long term. Although there have been many works exploiting appearance information or spatial smoothness, we are not aware of any work employing optical flow trajectories to define a likelihood of matching detections. Recently, Fragkiadaki et al. [13] introduced a method to track multiple targets while jointly clustering optical flow trajectories. The work presents a promising result, but the model is complicated due to the joint inference on both target and flow level association. In contrast, our ALFD provides a strong pairwise affinity measure that is generally applicable in any tracking model.

3. Aggregated Local Flow Descriptor

The Aggregated Local Flow Descriptor (ALFD) encodes the relative motion pattern between two bounding boxes in a temporal distance (t = |ti - tj|) given interest point trajectories [31]. The main intuition in ALFD is that if the two boxes belong to the same target, we shall observe many supporting IPTs in the same relative location with respect to the boxes. In order to make it robust against small localization errors in detections, targets' orientation change, and outliers/errors in the IPTs, we build the ALFD using spatial histograms. Once the ALFD is obtained, we measure the affinity between two detections using the linear product of a learned model parameter wt and ALFD, i.e. aA(di, dj) = wt ? (di, dj). In the following subsections, we discuss the details of the design.

3.1. Interest Point Trajectories

We obtain Interest Point Trajectories using a local interest point detector [4, 30] and optical flow algorithm [4, 11]. The algorithm is designed to produce a set of long and accurate point trajectories, combining various well-known computer vision techniques. Given an image It, we run the FAST interest point detector [4, 30] to identify "good

t

t + t

v.s.

t = 1

t = 20

1

1

0.5

0.5

0

0

-0.5

-0.5

-1

-1

Figure 4. Visualization of two learned model weights w1 and w20. Having a higher value in the bright (white) bins yields a higher affinity measure. As the temporal distance increase, the model weights tend to spread out to the adjacent bins to account for a possible targets' orientation change and higher IPT errors.

Figure 3. Illustrative figure for unidirectional ALFDs (di, dj). In the top figure, we show detections as colored bounding boxes (dred, dblue, and dgreen). A pair of circles with connecting lines represent IPTs that are existing in both t and t + t and located inside of the dred at t. We draw the accurate (green), outlier (black), and erroneous (red) IPTs. In the bottom figure, we show two exemplar unidirectional ALFDs for (dred, dblue) and (dred, dgreen). The red grids (2 ? 2) represent the IPTs' location at t relative to dred. The blue and green grids inside of each red bin (2?2+2 external bins) shows the IPTs' location at t+ t relative to the corresponding boxes. IPTs in the grid bins with a red box are the one observed in the same relative location. Intuitively, the more IPTs are observed in the bins, the more likely the two detections belong to the same target. In contrast, wrong matches will have more supports in the outside bins. The illustration is shown using 2 ? 2 grids to avoid clutter. We use 4 ? 4 in practice. The figure is best shown in color.

points" to track. In order to avoid having redundant points, we compute the distance between the newly detected interest points and the existing IPTs and keep the new points sufficiently far from the existing IPTs (> 4 px). The new points are assigned unique IDs. For all the IPTs in t, we compute the forward (t t + 1) and backward (t + 1 t) optical flow using [4, 11]. The starting points of backward flows are given by the forward flows' end point. Any IPT having a large disagreement between the two (> 10 px) is terminated.

3.2. ALFD Design

Let us define the necessary notations to discuss ALFD. id K represents an IPT with a unique id that is parameterized by pixel locations (id(t)[x], id(t)[y]) during the time of presence. id(t) denotes the pixel location at the frame t. If id does not exist at t (terminated or not initiated), ? is returned.

We first define a unidirectional ALFD (di, dj), i.e. from di to dj, by aggregating the information from all the IPTs that are located inside of di box and existing at tj. Formally, we define the IPT set as K(di, dj) = {id|id(ti) di & id(tj) = ?}. For each id K(di, dj), we compute the relative location ri(id) = (x, y) of each id at ti by ri(id)[x] = (id(ti)[x]-di[x])/di[w] and ri(id)[y] = (id(ti)[y]-di[y])/di[h]. We compute rj(id) similarly. Notice that ri(id) are bounded between [0, 1], but rj(id) are not bounded since id can be outside of dj. Given the ri(id) and rj(id), we compute the corresponding spatial grid bin indices as shown in the Fig. 3 and accumulate the count to build the descriptor. We define 4 ? 4 grids for ri(id) and 4 ? 4 + 2 grids for rj(id) where the last 2 bins are accounting for the outside region of the detection. The first outside bin defines the neighborhood of the detection (< width/4 & < height/4), and the second outside bin represents any farther region.

Using a pair of unidirectional ALFDs, we define the ALFD as (di, dj) = ( (di, dj) + (dj, di)) / n(di, dj), where n(di, dj) is a normalizer. The normalizer n is defined as n(di, dj) = |K(di, dj)| + |K(dj, di)| + , where |K(?)| is the count of IPTs and is a constant. ensures that the L1 norm of the ALFD increases as we have more supporting K(di, dj) and converges to 1. We use = 20 in practice.

3.3. Learning the Model Weights

We learn the model parameters wt from a training dataset with a weighted voting. Given a set of detections DT1 and corresponding ground truth (GT) target annotations, we

first assign the GT target id to each detections. For each de-

tection di, we measure the overlap with all the GT boxes in ti. If the best overlap oi is larger than 0.5, the corresponding target id (idi) is assigned. Otherwise, -1 is assigned. For all detections that has idi 0 (positive detections), we collect a set of detections Pit = {dj DT1 |tj - ti = t}. For each pair, we compute the margin mij as follows: if idi and idj are identical, mij = (oi - 0.5) ? (oj - 0.5). Otherwise, mij = -(oi - 0.5) ? (oj - 0.5). Intuitively, mij shall have a positive value if the two detections are from the same

target, while mij will have a negative value, if the di and dj are from different targets. The magnitude is weighted by

the localization accuracy. Given all the pairs and margins,

we learn the model wt as follows:

wt =

{iDT1 |idi0} jPit mij ( (di, dj ) + (dj , di))

(1)

{iDT1 |idi0} jPit |mij |( (di, dj ) + (dj , di))

The algorithm computes a weighted average with a sign over all the ALFD patterns, where the weights are determined by the overlap between targets and detections. Intuitively, the ALFD pattern between detections that matches well with GT contributes more on the model parameters. The advantage of the weighted voting method is that each element in wt are bounded in [-1, 1], thus the ALFD metric, aA(di, dj), is also bounded by [-1, 1] since ||(di, dj)||1 1. Fig. 4 shows two learned model using our method. One can adopt alternative learning algorithms like SVM [6].

3.4. Properties

In this section, we discuss the properties of ALFD affinity metric aA(di, dj). Firstly, unlike appearance or spatial metrics, ALFD implicitly exploit the information in all the images between ti and tj through IPTs. Secondly, thanks to the collective nature of ALFD design, it provides strong affinity metric over arbitrary length of time. We observe a significant benefit over the appearance or spatial metric especially over a long temporal distance (see Sec. 6.1 for the analysis). Thirdly, it is generally applicable to any scenarios (either static or moving camera) and for any object types (person or car). A disadvantage of the ALFD is that it may become unreliable when there is an occlusion. When an occlusion happens to a target, the IPTs initiated from the target tend to adhere to the occluder. It motivates us to combine target dynamics information discussed in Sec. 5.1.

4. Near Online Multi-target Tracking (NOMT)

We employ a near-online multi-target tracking framework that updates and outputs targets At in each time frame considering inputs in a temporal window [t-, t]. We implement the NOMT algorithm with a hypothesis generation and selection scheme. For the convenience of discussion,

we define clean targets At-1 = {A1t-1, A2t-1, ...} that exclude all the associated detections in [t-, t-1]. Given a set of detections in [t-, t] and clean targets At-1, we generate multiple target hypotheses Hmt = {Hmt ,1 = ?, Hmt ,2, Hmt ,3...} for each target Am t-1 as well as newly entering targets, where ? (empty hypothesis) represents the termination of the target and each Hmt ,k indicates a set of candidate detections in [t-, t] that can be associated to a target (Sec. 4.2). Each Hmt ,k may contain 0 to detections (at one time frame, there can be 0 or 1 detection). Given the set of hypotheses for all the existing and new targets, the algorithm finds the most consistent set of hypotheses (MAP) for all the targets (one for each) using a graphical model (sec. 4.3). As the key characteristic, our algorithm is able to fix any association error (for the detections within the temporal window [t-, t] ) made in the previous time frames.

4.1. Model Representation

Before going into the details of each step, we discuss our underlying model representation. The model is formulated as an energy minimization framework; x^ = argminx E(At-1, Ht(x), Dtt- , V1t), where x is an integer state vector indicating which hypothesis is chosen for a corresponding target, Ht is the set of all the hypotheses {H1t, H2t, ...}, and Ht(x) is a set of selected hypothesis {H1t,x1 , H2t,x2 , ...}. Solving the optimization, the updated targets At can be uniquely identified by augmenting At-1 with the selected hypothesis Ht(x^). Hereafter, we drop V1t and Dtt- to avoid clutters in the equations. The energy is defined as follows:

E(At-1, Ht(x)) =

(Amt-1, Hmt ,xm )

mAt-1

+

(Hmt ,xm , Hlt,xl )

(2)

(m,l)At-1

where (?) encodes individual target's motion, appearance, and ALFD metric consistency, and (?) represent an exclusive relationship between different targets (e.g. no two targets share the same detection). If there are hypotheses for newly entering targets, we define the corresponding target as an empty set, Amt-1 = ?.

Single Target Consistency

The potential measures the compatibility of a hypothesis Hmt ,xm to a target Amt-1. Mathematically, this can be decomposed into unary, pairwise and high order terms as

follows:

(Amt-1, Hm t ,xm ) =

u(Amt-1, di)

iHm t ,xm

+

p(di, dj ) + h(Amt-1, Hm t ,xm )

(i,j)Hm t ,xm

(3)

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download