Optical Flow  An Overview
 Definition of Optical Flow
 Useful Resources
 Traditional Approach
 Dataset
 Evaluation Metric

Endtoend regression based optical flow estimation
 Some useful concepts
 Overview of different models
 FlowNet (ICCV 2015) paper
 FlowNet 2.0 (CVPR 2017) paper
 SPyNet (CVPR 2017) paper code
 PWCNet (CVPR 2018) paper code video
 IRRPWCNet (CVPR 2019) paper
 PWCNet Fusion (WACV 2019) paper
 ScopeFlow (CVPR 2020) paper code
 MaskFlownet (CVPR 2020) paper code
 RAFT: REcurrent AllPairs Field Transforms for Optical Flow (Arxiv 2020) paper code
Definition of Optical Flow
Distribution of apparent velocities of movement of brightness pattern in an image.
 Where do we need it?
 Action recognition
 Motion segmentation
 Video compression
Useful Resources
 CMU Computer Vision 16385
 CMU Computer Vision 16720
 Papers
 FlowNet: Learning Optical Flow with Convolutional Networks (ICCV 2015)
 FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks (CVPR 2017)
 Optical Flow Estimation using a Spatial Pyramid Network (CVPR 2017)
 Optical Flow Estimation in the Deep Learning Age (2020/04/06)
 PWCNet: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume (CVPR 2018)
 Iterative Residual Refinement for Joint Optical Flow and Occlusion Estimation (CVPR 2019)
 A fusion approach for multiframe optical flow estimation (WACV 2019)
 ScopeFlow: Dynamic Scene Scoping for Optical Flow (CVPR 2020)
 MaskFlownet: Asymmetric Feature Matching with Learnable Occlusion Mask (CVPR 2020)
 RAFT: REcurrent AllPairs Field Transforms for Optical Flow (Arxiv 2020)
Traditional Approach
Brightness Constancy Assumption
\[\begin{align*} I(x(t), y(t), t) = C \end{align*}\]Small Motion Assumption
\[\begin{align*} &I(x(t+\delta t), y(t+\delta t), t+ \delta t) = I(x(t), y(t), t) + \frac{dI}{dt}\delta t, \text{(higher order term ignored)}\\ &\text{where }\frac{dI}{dt} = \frac{\partial I}{\partial x}\frac{d x}{d t} + \frac{\partial I}{\partial y}\frac{d y}{d t} + \frac{\partial I}{\partial t}\triangleq I_x u + I_y v + I_t \triangleq \nabla^T I [u, v]^T + I_t \end{align*}\]\(\nabla I = [I_x, I_y]^T\) : spatial derivative
\(I_t\) : temporal derivative
\([u, v]\) : optical flow velocities
Brightness Constancy Equation
Combining the above two assumptions, we obtain
\[\begin{align*} \nabla I [u, v]^T + I_t = 0. \end{align*}\]How to solve Brightness Constancy Equation?
Temporal derivative \(I_t\) can be estimated by frame difference; spatial derivative \(\nabla I\) can be estimated using spatial filters. Since there are two unknowns (\(u\) and \(v\)), the system is underdetermined.
Two ways to enforce additional constraints:
 LucasKanade Optical Flow (1981) : assuming local patch has constant flow
 LS can be applied to solve this overdetermined set of equations
 If there is lack of spatial gradient in a local path, then the set of equations could still be underdetermined. This is referred to as the
aperture
problem  If applied to only tractable patches, these are called sparse flow
 HornSchunck Optical Flow (1981) : assuming a smooth flow field
Formulation of HornShunck Optical flow
Brightness constancy constraint/loss :
\[\begin{align*} E_d(i, j) = \left[I_x(i,j) u(i,j)+I_y(i,j)v(i,j) + I_t(i,j)\right]^2 \end{align*}\]Smoothness constraint/loss :
\[\begin{align*} E_s(i, j) = \frac{1}{4}\left[(u(i,j)u(i+1,j))^2, (u(i,j)u(i,j+1))^2, (v(i,j)v(i+1,j)^2, (v(i,j)v(i,j+1))^2\right] \end{align*}\]Solving for optical flow :
\[\begin{align*} \text{min}_{\bf{u}, \bf{v}} \sum_{i,j} E_d(i,j) + \lambda E_s(i,j) \end{align*}\]Gradient descent can be used to solve the above optimization problem.
Discrete Optical Flow Estimation
Brightness Constancy Equation assumes small motion, which is in general not the case. If the movement is beyond 1 pixel, then higher order terms in the Taylor expansion of \(I(x(t), y(t), t)\) could dominate. There are two solutions
 To reduce the resolution using coarsetofine architecture
 Resort to discrete optical flow estimation
For case2, we obtain optical flow estimate by minimizing the following objective
\[\begin{align*} &E({\bf{z}}) = \sum_{i\in\mathcal{I}}D(z_i) + \sum_{(i,j)\in \mathcal{N}}S(z_i, z_j)\\ &\text{where } z_i\triangleq (u_i, v_j), \mathcal{I}\triangleq\text{set of all pixels }, \mathcal{N}\triangleq\text{set of all neighboring pixels} \end{align*}\]The above can be viewed as energy minimization in a Markov random field.
Dataset
Table 1 from FlowNet
Entry  Frame Pairs  Frames with ground truth  Groundtruth density per frame 

Middlebury  72  72  100% 
KITTI2012  194  194  50% 
MPI Sintel  1041  1041  100% 
Flying Chairs  22872  22972  100% 
Flying Things 3D  22872    100% 
Middlebury (link, paper)
Contains only 8 image pairs for training, with ground truth flows generated using four different techniques. Displacements are very small, typically below 10 pixels. (Section 4.1 in FlowNet)
MPI Sintel (link, paper)
Computeranimated action movie. There are three render passes with varying degree of realism
 Albedo render pass
 Clean pass (adds natural shading, cast shadows, specular reflections, and more complex lighting effects)
 Final pass (adds motion blur, focus blur, and atmospherical effect)
Contains 1064 training / 564 withheld test flow fields
KITTI (link, paper)
Contains 194 training image pairs and includes large displacements, but contains only a very special motion type. The ground truth is obtained from real world scenes by simultaneously recording the scenes with a camera and a 3D laser scanner. This assumes that the scene is rigid and that the motion stems from a moving observer. Moreover, motion of distant objects, such as the sky, cannot be captured, resulting in sparse optical flow ground truth.
Flying Chairs (link, paper)
Contains about 22k image pairs of chairs superimposed on random background images from Flickr. Random affine transformations are applied to chairs and background to obtain the second image and ground truth flow fields. The dataset contains only planar motions. (Section 3 in FlowNet 2.0)
Flying Things 3D (link, paper)
A natural extension of the FlyingChairs dataset, having 22,872 larger 3D scenes with more complex motion patterns.
Evaluation Metric
Angular Error (AE)
AE between \((u_0, v_0)\) and \((u_1, v_1)\) is the angle in 3D space between \((u_0, v_0, 1.0)\) and \((u_1, v_1, 1.0)\). Error in large flow is penalized less than errors in small flow. (Section 4.1 in link)
End Point Error (EPE)
EPE between \((u_0, v_0)\) and \((u_1, v_1)\) is \(\sqrt{(u_0u_1)^2 + (v_0v_1)^2}\) (Euclidean distance).
For Sintel MPI, papers also often reports detailed breakdown of EPE for pixels with different distance to motion boundaries (\(d_{010}\), \(d_{1060}\), \(d_{60140}\)) and different velocities (\(s_{010}\), \(s_{1040}\), \(s_{40+}\)).
Endtoend regression based optical flow estimation
Some useful concepts
 Backward warping
\(I_1(\cdot, \cdot)\triangleq I(\cdot, \cdot, t=1), I_2(\cdot, \cdot)\triangleq I(\cdot, \cdot, t=2)\). Optical flow field \(u, v\) satisfies \(I_1(x, y) = I_2(x+u, y+v)\). In other words, \(u,v\) tells us where each pixel in \(I_1\) is coming from, compared with \(I_2\), and given \(u, v\), we know how to move around (warp) the pixels in \(I_2\) to obtain \(I_1\). Here \(I_1\) is often referred to as the source image and \(I_2\) the target image – flow vector is defined per source image. Specifically, we can define a warp
operation as below
 Compositivity of backward warping
Overview of different models
Model Name  Num of parameters  inference speed  Training time  MPI Sintel final test EPE 

FlowNetS  32M  87.72fps  4days  7.218 
FlowNetC  32M  46.10fps  6days  7.883 
FlowNet2.0  162M  11.79fps  14days  6.016 
SPyNet  1.2M      8.360 
PWCNet  8.7M  35.01fps  4.8days  5.042 
The EPE column is taken from Table 2 of an overview paper. The inference speed (on Pascal Titan X) and training time column is taken from Table 7 of PWCNet paper.
FlowNet (ICCV 2015) paper
The first endtoend CNN architecture for estimating optical flow. Two variants:
 FlowNetS
 A pair of input images is simply concatenated and then input to the Ushaped network that directly outputs optical flow.
 FlowNetC
 FlowNetC has a shared encoder for both images, which extracts a feature map for each input image, and a cost volume is constructed by measuring patchlevel similarity between the two feature maps with a correlation operation. The result is fed into the subsequent network layers.
Multiscale training loss is applied. Both models still underperform energybased approaches.
FlowNet 2.0 (CVPR 2017) paper
Key ideas:
 By stacking multiple FlowNet style networks, one can sequentially refine the output from previous network modules.
 It is helpful to pretrain networks on a less challenging synthetic dataset first and then further train on a more challenging synthetic dataset with 3D motion and photometric effects
Endtoend based approach starts to outperform energybased ones.
SPyNet (CVPR 2017) paper code
Key idea:
 Incorporate classic
coarsetofine
concepts into CNN network and update residual flow over mulitple pyramid levels (5 image pyramid levels are used). Networks at different levels have separate parameters.
Achieves comparable performance to FlowNet with 96% less number of parameters.
PWCNet (CVPR 2018) paper code video
Key ideas:
 Learned feature pyramid instead of image pyramid
 Warping of feature maps
 Computing a cost volume of learned feature maps (correlation)
Computation steps:
 Feature pyramid extractor: convnet with downsampling
 Target feature map is warped by upsampled previous flow estimation
 Cost volume is computed based on source feature map and warped target feature map
 Optical flow estimator: a DenseNet type of network that takes (1) source feature map (2) cost volume (3) upsampled previous optical flow estimate
 Context network: a dilated convolution network to post process the estimated optical flow
Remarks:
 Multiscale training loss
 Network at each scale estimates the optical flow for that scale, not the residual optical flow (the addition happens implicitly inside the optical flow estimator).
IRRPWCNet (CVPR 2019) paper
Key ideas
 Take the output from a previous pass through the network as input and iteratively refine it by only using a single network block with shared weights, which allows the network to residually refine the previous estimate.
 For PWCNet, the decoder module at different pyramid level is achieved using a 1x1 convolution before feeding the source feature map to the optical flow estimator/decoder.
 Joint occlusion and bidirectional optical flow estimation leads to further performance enhancement.
PWCNet Fusion (WACV 2019) paper
The paper focuses on threeframe optical flow estimation problem: given \(I_{t1}\), \(I_{t}\), and \(I_{t+1}\), estimate \(f_{t\to t+1}\).
Key ideas:
 If we are given \(f_{t1\to t}\) and \(f_{t\to t1}\), and assume constant velocity of movement, then an estimate of \(f_{t\to t+1}\) can be formed by backward warping \(f_{t1\to t}\) with \(f_{t\to t1}\). \(\begin{align*} &\widehat{f}_{t\to t+1} \triangleq \texttt{warp}(f_{t1 \to t}, f_{t\to t1}), \\ &\widehat{f}_{t\to t+1}(x, y) \triangleq f_{t1 \to t}\left(x+f_{t\to t1}(x,y)(x), y+f_{t\to t1}(x,y)(y)\right) \end{align*}\)
 With three frames available, we can plugin any twoframe optical flow estimation solution (PWCNet in this case) to obtain \(f_{t1 \to t}\), \(f_{t\to t+1}\) and \(f_{t \to t1}\).
 A fusion network (similar to the one used in the last stage of FlowNet 2.0) can be used to fuse together \(\widehat{f}_{t \to t1}\triangleq\texttt{warp}(f_{t1 \to t}, f_{t\to t1})\) and \(f_{t \to t+1}\).
 Note that \(\widehat{f}_{t\to t1}\) would be identical to \(f_{t\to t+1}\) if (a) velocity is constant (b) three optical flow estimations are correct, and (c) there are no occlusions. Brightness constancy errors of the two flow maps together with the source frame \(I_t\) are fed into the fusion network to provide additional info.
Why multiframe may perform better than 2frame solutions:
 temporal smoothness leads to additional regularization.
 longer time sequences may help in ambiguous situations such as occluded regions.
ScopeFlow (CVPR 2020) paper code
ScopeFlow revisits the following two parts in the conventional endtoend training pipeline/protocol
 Data augmentation:
 photometric transformations: input image perturbation, such as color and gamma corrections.
 geometric augmentations: global or relative affine transformation, followed by random horizontal and vertical flipping.
 cropping
 Regularization
 weighted decay
 adding random Gaussian noises
and advocates
 use larger scopes (crops and zoomout) when possible.
 gradually reduce regularization
MaskFlownet (CVPR 2020) paper code
Key idea:
 Incorporates a learnable occlusion mask that filters occluded areas immediately after feature warping without any explicit supervision.