[paper reading][CVPR 2020] Temporal Pyramid Network for Action Recognition


目录
  • 1 Introduction
  • 2 Related Work
    • Video Action Recognition
    • Visual Tempo Modeling in Action Recognition
  • 3 Temporal Pyramid Network
    • 3.2
    • 3.3

  • CVPR 2020
  • https://openaccess.thecvf.com/content_CVPR_2020/papers/Yang_Temporal_Pyramid_Network_for_Action_Recognition_CVPR_2020_paper.pdf
  • visual tempo, temporal scales
  • previous: sample raw videos at multiple rates, frame pyramid, multi-branch
  • feature hierarchy
  • imrpovements, plug-and-play, especially large variances in tempos

1 Introduction

  • Visual tempo actually describes how fast an action goes, which tends to determine the effective duration at the temporal scale for recognition
    • inter-class difference, hand clapping and walking
    • intra-class difference, somersault
    • pyramid, multi-branch, multiple features outputs, combine
    • temporal receptive field, different depth, in a single model, catch both
    • feature-level aggregation
    • ablation, most of its improvements, significant variances

2 Related Work

Video Action Recognition

  • 2D, then 1D paradigm
  • per-frame and optical flow (two stream)
  • variants
  • 2D CNN, not temporal at early stages
  • 3D? non-local, inflating, decomposing...

Visual Tempo Modeling in Action Recognition

3 Temporal Pyramid Network

  • TPN, single network, plug-and-play, feature level
  • collect hierarchical features?
    • "single depth": multiple rates for multiple tensors, but the size \(C*T*W*H\) is the same (the same spatial granularity)
    • "multi depth": multiple sizes, richer semantics, careful treatment of fusion, ensure correct information flows
  • modulation, align, stride, match: shape and receptive field
  • auxiliary classification head, stronger supervision, \(\mathcal L_{total} = \mathcal L_{CE,o} + \sum \lambda_i \mathcal L_{CE, i}\)
  • then, temporal modulation, \(\alpha\), flexible, downsample, factor

3.2

  • aggregate? isolation, bottom-up, top-down
  • element-wise, compatibility of the addition, \(\delta\) factor
  • information flow:
    • "Cascade", bottom-up after a top-down
    • "Parallel": simultaneously

3.3

  • ResNet, 3D backbone
  • res2, res3, res4, res5, downsampled
  • stride, max-pooling..., can be trained in an end-to-end manner