[paper reading][CVPR 2020] Temporal Pyramid Network for Action Recognition
目录
- 1 Introduction
- 2 Related Work
- Video Action Recognition
- Visual Tempo Modeling in Action Recognition
- 3 Temporal Pyramid Network
- 3.2
- 3.3
- CVPR 2020
- https://openaccess.thecvf.com/content_CVPR_2020/papers/Yang_Temporal_Pyramid_Network_for_Action_Recognition_CVPR_2020_paper.pdf
- visual tempo, temporal scales
- previous: sample raw videos at multiple rates, frame pyramid, multi-branch
- feature hierarchy
- imrpovements, plug-and-play, especially large variances in tempos
1 Introduction
- Visual tempo actually describes how fast an action goes, which tends to determine the effective duration at the temporal scale for recognition
- inter-class difference, hand clapping and walking
- intra-class difference, somersault
- pyramid, multi-branch, multiple features outputs, combine
- temporal receptive field, different depth, in a single model, catch both
- feature-level aggregation
- ablation, most of its improvements, significant variances
2 Related Work
Video Action Recognition
- 2D, then 1D paradigm
- per-frame and optical flow (two stream)
- variants
- 2D CNN, not temporal at early stages
- 3D? non-local, inflating, decomposing...
Visual Tempo Modeling in Action Recognition
3 Temporal Pyramid Network
- TPN, single network, plug-and-play, feature level
- collect hierarchical features?
- "single depth": multiple rates for multiple tensors, but the size \(C*T*W*H\) is the same (the same spatial granularity)
- "multi depth": multiple sizes, richer semantics, careful treatment of fusion, ensure correct information flows
- modulation, align, stride, match: shape and receptive field
- auxiliary classification head, stronger supervision, \(\mathcal L_{total} = \mathcal L_{CE,o} + \sum \lambda_i \mathcal L_{CE, i}\)
- then, temporal modulation, \(\alpha\), flexible, downsample, factor
3.2
- aggregate? isolation, bottom-up, top-down
- element-wise, compatibility of the addition, \(\delta\) factor
- information flow:
- "Cascade", bottom-up after a top-down
- "Parallel": simultaneously
3.3
- ResNet, 3D backbone
- res2, res3, res4, res5, downsampled
- stride, max-pooling..., can be trained in an end-to-end manner