JDE Towards Real-Time Multi-Object Tracking 英文解读
SDE methods bring critical challenges in building a real-time MOT system
Faster RCNN = Fast RCNN + RPN
Seperate Detection and Embedding
Detector -> Cropped Image -> ReID model -> reid feature
RPN -> Detection -(sharing feature map)-> reid embedding
Joint Detection and Embedding
\[\mathbf{F}_{i} = \text{Head}(\text{FPN}_{i}) \]Design/Training
- Anchor
- modified from original
- adapted for MOT task
- all anchors are set to an aspect ratio of 1 : 3.
Contrasive Learning
The margin term is neglected for convenience.
looking at a mini-batch and mining all the negative samples \(f^{-}_{i}\) and the hardest positive sample \(f^{+}\) in this mini-batch
\(f^{T}\) is the selected anchor in the batch.
this is the upper bound of triplet loss
this is the cross entropy loss \(\mathcal{L} = \sum_{c =1}^{\text{Cls.}}\mathbb{I}(y_{i} = c)\log p(f(x) =c))\) with \(p = \text{Softmax}(g^{+},\{g^{-}\})\)
Multi-task training
\(M\) is the number of prediction heads.
+ Question: So, each feature map at different scale in FPN is trained.
+ But during inference, which feature map can we use?
+ Or rather, should we design a strategy to
+ further fuse the predictions AT DIFFERENT SCALES?
ref:Multi-task learning using uncertainty to weigh losses for scene geometry and semantics.
we employ task-dependent uncertainty [16] to dynamically weight the heterogenous losses.
a metric learning problem
Get Embedding
+ Question: It seems that FPN with the heads is not clearly described in the paper.
\[\mathbf{T}_{i} = \{e_{i},m_{i}\} \]- \(e_{i}\) is appearance state
- \[e_{i}^{t} = \alpha e^{t-1}_{i} + (1-\alpha)f_{i}^{t} \]
- \(f_{i}^{t}\) is the
appearance embedding
- \(m_{i}^{t}\) is maintained by
Kalman Filter
using Hungarian algorithm
for linking
One may notice that JDE has a lower IDF1 score and more ID switches than existing methods. At first we suspect the reason is that the jointly learned embedding might be weaker than a separately learned embedding.
However, when we replace the jointly learned embedding with the separately learned embedding, the IDF1 score and the number of ID switches remain almost the same.