H.264 Inter Coding

In this post, I will examine the inter-coding aspects of H.264. Inter-coding is designed to exploit temporal correlation in a given slice. For example, when encoding a background which remains the same through several frames, a macroblock which is part of the background needs to be encoded only once, as opposed to encoding it once per frame. You can read my earlier posts on video coding here:

All the video codecs from MPEG1-4 (including H.264) use “block-based motion compensation”. When an object moves in the video, the blocks associated with the object, while remaining the same, move to new locations in later frames. This movement can be identified by a vector (called motion vector), and instead of encoding the block itself, it is typically more efficient to encode the identity of the reference frame (also called anchor frame) in which the block occurs and the motion vector itself. The process by which the decoder uses this information (reference frame + motion vector) to render the block is called motion compensation. 
In MPEG1-4, P-frames could use one motion vector per macroblock (B-frames could use two motion vectors, one for the past reference frame, and one for the future). In H.264, multiple motion vectors corresponding to several reference frames can be used, and motion vectors can be computed all the way down to 4×4 blocks. 
A 16×16 luma macroblock in H.264 can be inter-predicted in several different ways — 1 16×16 block, 2 16×8 blocks, 2 8×16 blocks, 4 8×8 blocks (One or more of such blocks can be inter-predicted, while the rest could be intra-coded). These are called partitions of a macroblock. Each 8×8 block can, in turn, be inter-predicted as 1 8×8 block, 2 8×4 blocks, 2 4×8 blocks, 4 4×4 blocks. These are called sub-partitions (see picture below). This method of using partitions and sub-partitions for inter-prediction is called tree-structured motion compensation.

Note that if the inter-prediction for a particular macroblock works perfectly (i.e., the residual after quantization is all 0s), the macroblock is marked as a “skip”. Otherwise, larger partition sizes would lead to fewer bits for encoding the motion vector and the reference frames, but could lead to more bits spent in encoding the residual (actual partition – motion predicted partition) for partitions with high detail. On the other hand, choosing smaller partition sizes leads to more bits spent in encoding the motion vector and reference frames, but potentially fewer bits in encoding the residual. There are several algorithms to find the optimal split (into partitions and sub-partitions), but those are beyond the scope of this video coding series.

Also, the chrominance components of a macroblock are split into partitions in similar ways. For example, for a YCbCr422 input, the macroblock is of size 16×8 for Cb, Cr components. This block can be inter-predicted as 1 16×8 block, 2 8×8 blocks. Each 8×8 block can then be inter-predicted in several ways as described above.

For a given partition (or sub-partition), a similar-sized partition in a given reference frame is used for prediction. The offset of the partition in the reference frame with respect to the given partition is specified by a two-dimensional motion vector (see the picture below). In MPEG1-3, the resolution of the motion vectors was half-a-pixel (for example, (0.5, 1.5) would be a valid motion vector). For integer-valued motion vectors, the partition in the reference frame actually exists as real pixels. For fractional integer values like (0.5, 1.5), the partition in the reference frame has to be constructed from the neighboring (integer-valued) real pixels (as shown in picture ‘c’ below). In H.264, a 6-tap filter (weighted linear average of the 6-pixels on the same horizontal or vertical lines) is used to compute the half-pixel values (Note that for some positions, multiple possibilities arise; see here for more details).

In MPEG4-ASP (Advanced Simple Profile) and H.264, the resolution of motion vectors is quarter pel (quarter of a pixel). Once the half-pixel values are computed, the quarter pixel values are computed using a bilinear interpolation (an equation of the form v = a + b * x + c * y + d * x* y where x and y are hpel values, a, b, c, d are co-efficients, and v is the computed qpel value; hpel refers to half-pixel and qpel refers to quarter-pixel).

Note that motion vectors themselves are not transmitted in full for all the partitions. Motion vectors themselves are predicted on the basis of motion vectors of neighboring (sub-)partitions. In particular, the median value of the motion vectors, of the partitions immediately above, immediately to the left, and diagonally to the above and right, is used as a prediction for the motion vector of the partition under consideration (Note that this predictor is different when “uneven” partitions like 16×8 blocks are used, or when some of the motion vectors are unavailable). You can check the ffmpeg implementation here for more details.

References:

About annapureddy

Sidd is the VP of Engineering at Dyyno Inc. He received his Ph.D. and M.S. degrees from NYU, and Bachelor's from IIT Madras, all in Computer Science. http://www.scs.stanford.edu/~reddy/
This entry was posted in H.264, hpel, inter coding, inter prediction, motion compensation, motion vector, motion vector prediction, partition, qpel, skip block, sub-partition, sub-pixel, tree structured. Bookmark the permalink.

Leave a comment