Unicorn - Towards Grand Unification of Object Tracking

What

  • This paper proposed a new unified tracking approach named Unicorn.
  • Unicorn solves four tracking difficulties simultaneously (SOT, MOT, VOS, MOTS).
  • Unicorn performs on par or better than task-specific competitors in 8 tracking datasets.
    • LaSOT, TrackingNet, MOT17, BDD100K,DAVIS-16, DAVIS-17, MOTS20, BDD100K MOTS
image

Why?

Why do we want to do this?

  1. Single Task trackers may over-specialize on the characteristic of specific sub-tasks, lacking in the generalisation ability.
  2. Single Task tracker designs cause redundant parameters, which should be beneficial for other tracking tasks.

Why was this not done before?

Each task has different characteristics.

  1. Number of tracks:
    1. MOT usually tracks tens or even hundreds of instances of specific categories.
    2. SOT needs to track one target given in the reference frame no matter what class it belongs to.
  2. Correspondence:
    1. MOT needs to match the currently detected objects with previous trajectories.
    2. SOT requires distinguishing the target from the background.
  3. Image area:
    1. SOT methods only take a small search region as the input to save computation and filter potential distractors.
    2. MOT algorithms usually take the high-resolution full image as the input for detecting instances as completely as possible.

How?

New Design Choices

To design are introduced: 1) the target prior and 2) the pixel-wise correspondence.

Target Prior

Additional input for the detection head serves as the switch among four tasks. For SOT&VOS, the target prior is the propagated reference target map, enabling the head to focus on the tracked target. For MOT&MOTS, by setting the target as zero, the head degenerates smoothly into the usual class-specific detection head.

Pixel-wise correspondence

The pixel-wise correspondence is the similarity between all pairs of points from the reference and current frames.

Architecture

Three main components: 1) unified inputs and backbone 2) unified embedding 3) unified head.

image

Unified Inputs and Backbone

The reference and current frames are passed through a weight-sharing backbone to get feature pyramid representations, F_ref and F_cur.

Unified Embedding

A deformable attention layer is used to enhance the feature maps, which are then upsampled by 2x.

Eref,Ecur=Upsample(Attention(Fref,Fcur)){E_{ref} , E_{cur} } = Upsample(Attention(F_{ref} , F_{cur} ))

The pixel-wise correspondence is then computed by the correspondence between the embeddings.

Cpix=softmax(EcurErefT)Cinst=softmax(ecurerefT)C_{pix} =softmax(E_{cur}E_{ref}^T)\\C_{inst} =softmax(e_{cur}e_{ref}^T)
❓
I don’t understand how the instance embeddings are calculated. The instance embedding e is extracted from the frame embedding E, where the center of the instance is located.

Unified Head

See supplementary for more information (not public yet).

Training

  1. The network is end-to-end optimised with the correspondence loss and the detection loss using data from SOT&MOT
  2. A mask branch is added and optimized with the mask loss using data from VOS&MOTS with other parameters fixed.

Inference

❓
I’m not exactly sure how inference is performed on MOT. The paper does not mention how the association is performed.

And?

Training on more data is generally good. This paper shows that it is possible to train on SOT, MOT, VOS and MOTS datasets.

This note is a part of my paper notes series. You can find more here or on Twitter.