RAFT: Recurrent All-Pairs Field Transforms for Optical Flow

RAFT: Recurrent All-Pairs Field Transforms for Optical Flow (ECCV 2020)

Authors: Zachary Teed and Jia Deng
Affiliations: Princeton University

The goal here is to extract optical flow from two images.
They claim state of the art accuracy, strong generalization, and high efficiency (10 fps at 1088x436 on a 1080 ti)

Method

Their main contribution is a recurrent architecture which extracts state-of-the-art optical flow.
Existing architectures typically estimate optical flow at a low resolution and then upsample the optical flow.

RAFT consists of the following:

Feature encoder
Context encoder
4D Correlation layer
Update operator

Their architecture starts with two images and a full-resolution optical flow (initialized to zero). Then they build a feature encoding of the two images independently. From this encoding, they build a pyramid of 4D correlation volumes. Then the network iteratively updates this full-resolution optical flow using the 4D correlation layers.

Note that the correlation volumes are only build once and are not updated.

Pipeline

Each image is passed through the feature encoder, yielding a \(H, W, D\) feature map for each image. Here each pixel is represented by a \(D\) dimensional feature vector.

The 4D correlation calculated from the dot product between all pairs of feature vectors.
This results in a \(H, W, H, W\) tensor. I.e. index \((i,j,k,l)\) is the dot product between the vector for pixel \((i,j)\) in image 1 and pixel \((k,l)\) in image 2.

Then a correlation pyramid is calculated by polling to 1, 1/2, 1/4, and 1/8 resolutions.

Given an optical flow map, this pyramid is sampled with bilinear interpolation to get a single feature map. At each iteration, the update network gets the updated correlation map (built from the latest optical flow map), the context from the context encoder, and the hidden state from the last iteration. The update is added to the optical flow map.

Architecture

Details about their networks can be found in their code base and in their supplementary material at the end of their paper.

Feature Encoder

Their feature encoder is a CNN with 6 residual blocks, 2 at 1/2 resolution (1/2 width & 1/2 height), 2 at 1/4 resolution, and 2 at 1/8 resolution.
Each residual block consists of conv-bn-relu-conv-bn-relu-cat-relu with 3x3 convolutions. 2 Residual blocks will consist of 4 convolutional layers. The first conv block is strided to decrease the output resolution.

Their full encoder is:

7x7 conv (stride 2, padding 3), BN, ReLu with 64 output channels (output is 1/2 resolution)
2 Residual blocks at 1/2 width & height (stride = 1 so output is still 1/2 resolution) with 64 channels
2 Residual blocks at 1/4 width & height (stride = 2 so output is 1/4 resolution) with 128 channels
2 Residual blocks at 1/8 width & height (stride = 2 so output is 1/8 resolution) with 192 channels
3x3 conv with 256 channels.

Note in their code the channels are actually hard-coded to 64, 64, 96, 128, 128.

They also have a smaller model with half the channels everywhere.
This is consistent between their code and the supplementary material: 32, 32, 64, 96, 96.

Update Block

This is the core part of the network.
The inputs are the flow, correlation, context, and latent hidden state.
Note that the inputs are not shown clearly in the main figure. Th flow is also an input to the network.
In their code the inputs are:

net the hidden features
inp the context
corr the correlation
flow the flow.

The outputs are:

net the new hidden features
mask the mask for upsampling the optical flow
delta_flow the flow update

Their update block consists of two ConvGRU units. See their paper and code for details.

Evaluation

They evaluate their network on Sintel and KITTI with pretraining on FlyingChairs and FlyingThings.
They also test on the DAVIS dataset.
More details are in the paper but their network is fine-tuned on several more datasets.