URI vs URL vs URN

URI — Uniform Resource Identifier
URL — Uniform Resource Locator
URN — Uniform Resource Name
URC — Uniform Resource Citation

The traditional view:
A URI is either a URL or a URN (it could have been a URC, which refers to the metadata associated with a resource instead of the resource itself; but the URC never really took off).

A URL identifies a resource in terms of its location. Thus, http://www.google.com identifies the resource as reachable using http + IP address of http://www.google.com.

A URN identifies a resource, but does not convey location or other metadata associated with the resource. Thus, isbn:1-23-456789-0 identifies a book, but does not give any metadata information about the resource.

The current view:
The set of URIs is partitioned into subsets called subspaces.
a) The http example above represents the http subspace within the URI space; http is called a URI scheme.
b) The isbn example above is now represented as urn:isbn:1-23-456789-0, and urn is a subspace within the URI space. isbn is a namespace within the URN subspace.
c) URL refers to the subspaces which identify a resource via its location.

References:
http://www.w3.org/TR/uri-clarification/

Posted in Uncategorized | Leave a comment

Objects, Functions and Prototypes in JavaScript

Javascript (JS) objects are fascinating and could confuse regular users. This post details the Javascript object model. There are several terms that we will examine:
— the hidden property/member __proto__ (some implementations of Javascript might not expose this property, or might expose it as read-only)
— the non-enumerable properties/members of some special objects: prototype and constructor
— the keywords: new and function

Every entity in Javascript is an object (apart from the primitives undefined, null, boolean, string, number). All objects have the hidden property called “__proto__” (at least for FF and Chrome implementations of JS). This is typically a read-only property, and points to the object based on which the given object was constructed. This notion will become clear once we see what the keyword new does; for now, it suffices to know that every object has this property.

A special class of objects in JS are functions. In addition to the __proto__ property, the function objects have a hidden “prototype” property. Note that in JS, any new object can only be created as an instance of a function (even when you do “var a = new Object;”, Object is a function btw), and this new object holds a reference to the function.prototype object, through its __proto__ property. Note that Function.prototype is the only function object that does not have a prototype property (because Function.prototype is defined as a  Function object in ECMAScript, instead of being defined as a regular object).

Yet another special class of objects in JS are prototypes. In addition to the __proto__ property, the prototype objects have a “constructor” property. This constructor property points to the function for which the current object is a prototype.

So, there are 3 kinds of objects in JS: functions, prototypes, and regular objects. Note that there are four special entities in JS: Object.prototype, Object, Function.prototype, Function. Object.prototype is the prototype object: its __proto__ property is null, and its constructor points to Object. Object is a function object: its __proto__ property is Function.prototype, its prototype points to Object.prototype. Function.prototype should have been a prototype object, but is instead defined by ECMAScript to be a function object: its __proto__ property is Object.prototype, its constructor points to Function, but has no prototype property (this is the only function object that does not have a prototype property; this function always returns undefined). Function is a function object: its __proto__ property points to Function.prototype, and its prototype property points to Function.prototype. As a summary, Object.prototype and Function.prototype are the primitive prototype objects, and Object and Function are the primitive function objects. The figure below illustrates this pretty well.

We have now discussed the properties __proto__, prototype and constructor. We will now discuss the keywords “new” and “function”

Whenever an object is created with “new” in JS (say “a = new A (arguments)”), the following sequence of actions happen:
— an empty object ({}) is created (let’s just refer to it by “a” like in the above statement)
— a.__proto__ = A.prototype
— A.call (a, arguments)
— return a

The “function” keyword can be used to create new function objects. You can think of “function f (arguments) {body}” as syntactic sugar for “f = new Function (arguments, body);” (with optimizations performed by the JS implementation in the former case). Whenever a function “f” is declared, the JS implementation creates the function object “f” and f.__proto__ points to Function.prototype (as discussed above in the explanation for the “new” keyword). In addition, the JS implementation creates a prototype object referenced by “f.prototype”. The JS implementation sets f.prototype.constructor to f, and f.prototype.__proto__ to Object.prototype. 

References:
— http://www.mollypages.org/misc/js.mp
— https://developer.mozilla.org/en/JavaScript/Reference/Functions_and_function_scope
— http://stackoverflow.com/questions/2344493/what-is-the-default-prototype-for-custom-function-in-javascript
— http://javascriptweblog.wordpress.com/2010/06/07/understanding-javascript-prototypes/
— http://phrogz.net/js/classes/OOPinJS2.html
— http://pivotallabs.com/users/pjaros/blog/articles/1368-javascript-constructors-prototypes-and-the-new-keyword
— http://stackoverflow.com/questions/4859308/in-javascript-why-typeof-function-prototype-is-function-not-object-like-ot

Posted in ECMA, Function, Function.prototype, javascript, new, object, Object.prototype, prototype, __proto__ | Leave a comment

Forming a Startup

This post is intended for the entrepreneurs who wish to start a small business. An important step is to form a company (apart from writing business plans to verify feasibility etc.) There are several choices that the entrepreneur faces, and this guide should help navigating those choices (please note that this is general information and not legal advice). Without further ado, let’s get started.

There are three issues involved in starting a business — liabilitytaxationease of operation. Liability relates to debt incurred on behalf of the business, tort claims etc., and whether the owner is liable for those. Taxation is either pass-through (the owners are taxed on their income, the company is not) or not. Ease of operation relates to regulation compliance requiring extensive book-keeping etc. 
There are typically four structures available to form a business (with differing guarantees on liability, taxation, ease of operation):

  • Sole proprietorship: This is the simplest way to commercialize your idea. You will typically have  to register with your city or county. The liability is entirely on the owner, and as such is not suitable for high risk businesses. Getting a business insurance policy should cover you for low risk businesses. You will pay self-employment tax on your income, and some minimal taxes to your city/county. A sole proprietorship is easy to operate. 
  • Limited Liability Company (LLC) or Limited Liability Partnership (LLP): This is usually better when multiple owners are involved. You will have to choose a name for the LLC, and register with the Secretary of State. You will also have to name a registered agent for service of process (a person who handles legal documents on your behalf). Liability protection is better with an LLC, as the LLC is treated as an entity different from the owner. Also, debt incurred by one of the owners does not affect the other owners. Note that you will have to keep personal and business accounts separate, otherwise the liability protection is lost (cref. piercing the veil). It is also a good idea to get a business insurance policy. Taxation is pass through, and is reported as personal income. An LLC is relatively easy to operate — you will have to file articles of organization with Secretary of State, have an operating agreement for each owner’s interest in the business, and keep accounts of income/expenditure. At least in California, you are either an LLC or an LLP based on your profession; there is no choice here. 
  • S-Corp: The steps involved are similar to forming an LLC. Liability and taxation are similar to an LLC. In terms of operation, it’s more complicated because of record keeping requirements etc. So, why would one form an S-corp as opposed to an LLC? Because profits from an S-corp are not subject to self-employment tax, as long as they are not distributed as dividends. 
  • C-Corp: Similar to S-corp and LLC with respect to liability. C-corp is treated as a separate taxable entity. Thus, if profits are distributed to owners, there is double taxation (once on the income for C-corp and then on the income to owners). So, why choose C-corp over an S-corp? S-corp has restrictions on who can be owners — they have to be US citizens/residents, and cannot be other corporations like VCs. Also, S-corps can only have one class of stock. 
Given the discussion above, you should be able to make a choice of business entity depending on your needs. If you are just starting out, I would recommend an LLC for its liability protection, pass-through taxation, and minimal record-keeping requirements. You can always convert it to a C-corp later (the legal fees can be high though). 
If you are in California, a frequently asked question is whether to form an entity in California or Delaware? There are two issues involved: case law and costs. The case law in Delaware is more developed and is favorable to management in shareholder lawsuits (which means that this is mostly irrelevant for startup owners, and is more relevant for VCs). Delaware has lower filing costs and annual taxes than California. But note that if your operations are in California, you will still need to register with California, and pay franchise taxes (a minimum of $800 annually). My recommendation is to go with California.
References:
Posted in c-corp, california, case law, delaware, double taxation, ease of operation, franchise tax, liability, llc, llp, pass through, piercing the veil, s-corp, sole proprietorship, taxation | Leave a comment

H.264 Inter Coding

In this post, I will examine the inter-coding aspects of H.264. Inter-coding is designed to exploit temporal correlation in a given slice. For example, when encoding a background which remains the same through several frames, a macroblock which is part of the background needs to be encoded only once, as opposed to encoding it once per frame. You can read my earlier posts on video coding here:

All the video codecs from MPEG1-4 (including H.264) use “block-based motion compensation”. When an object moves in the video, the blocks associated with the object, while remaining the same, move to new locations in later frames. This movement can be identified by a vector (called motion vector), and instead of encoding the block itself, it is typically more efficient to encode the identity of the reference frame (also called anchor frame) in which the block occurs and the motion vector itself. The process by which the decoder uses this information (reference frame + motion vector) to render the block is called motion compensation. 
In MPEG1-4, P-frames could use one motion vector per macroblock (B-frames could use two motion vectors, one for the past reference frame, and one for the future). In H.264, multiple motion vectors corresponding to several reference frames can be used, and motion vectors can be computed all the way down to 4×4 blocks. 
A 16×16 luma macroblock in H.264 can be inter-predicted in several different ways — 1 16×16 block, 2 16×8 blocks, 2 8×16 blocks, 4 8×8 blocks (One or more of such blocks can be inter-predicted, while the rest could be intra-coded). These are called partitions of a macroblock. Each 8×8 block can, in turn, be inter-predicted as 1 8×8 block, 2 8×4 blocks, 2 4×8 blocks, 4 4×4 blocks. These are called sub-partitions (see picture below). This method of using partitions and sub-partitions for inter-prediction is called tree-structured motion compensation.

Note that if the inter-prediction for a particular macroblock works perfectly (i.e., the residual after quantization is all 0s), the macroblock is marked as a “skip”. Otherwise, larger partition sizes would lead to fewer bits for encoding the motion vector and the reference frames, but could lead to more bits spent in encoding the residual (actual partition – motion predicted partition) for partitions with high detail. On the other hand, choosing smaller partition sizes leads to more bits spent in encoding the motion vector and reference frames, but potentially fewer bits in encoding the residual. There are several algorithms to find the optimal split (into partitions and sub-partitions), but those are beyond the scope of this video coding series.

Also, the chrominance components of a macroblock are split into partitions in similar ways. For example, for a YCbCr422 input, the macroblock is of size 16×8 for Cb, Cr components. This block can be inter-predicted as 1 16×8 block, 2 8×8 blocks. Each 8×8 block can then be inter-predicted in several ways as described above.

For a given partition (or sub-partition), a similar-sized partition in a given reference frame is used for prediction. The offset of the partition in the reference frame with respect to the given partition is specified by a two-dimensional motion vector (see the picture below). In MPEG1-3, the resolution of the motion vectors was half-a-pixel (for example, (0.5, 1.5) would be a valid motion vector). For integer-valued motion vectors, the partition in the reference frame actually exists as real pixels. For fractional integer values like (0.5, 1.5), the partition in the reference frame has to be constructed from the neighboring (integer-valued) real pixels (as shown in picture ‘c’ below). In H.264, a 6-tap filter (weighted linear average of the 6-pixels on the same horizontal or vertical lines) is used to compute the half-pixel values (Note that for some positions, multiple possibilities arise; see here for more details).

In MPEG4-ASP (Advanced Simple Profile) and H.264, the resolution of motion vectors is quarter pel (quarter of a pixel). Once the half-pixel values are computed, the quarter pixel values are computed using a bilinear interpolation (an equation of the form v = a + b * x + c * y + d * x* y where x and y are hpel values, a, b, c, d are co-efficients, and v is the computed qpel value; hpel refers to half-pixel and qpel refers to quarter-pixel).

Note that motion vectors themselves are not transmitted in full for all the partitions. Motion vectors themselves are predicted on the basis of motion vectors of neighboring (sub-)partitions. In particular, the median value of the motion vectors, of the partitions immediately above, immediately to the left, and diagonally to the above and right, is used as a prediction for the motion vector of the partition under consideration (Note that this predictor is different when “uneven” partitions like 16×8 blocks are used, or when some of the motion vectors are unavailable). You can check the ffmpeg implementation here for more details.

References:

Posted in H.264, hpel, inter coding, inter prediction, motion compensation, motion vector, motion vector prediction, partition, qpel, skip block, sub-partition, sub-pixel, tree structured | Leave a comment

H.264 Intra Coding

In this post, I will outline the algorithms used to encode macroblocks in intra-mode. Intra coding is designed to exploit spatial correlation in a given slice. For example, when encoding a slice which has uniform color throughout, there is a high correlation between the macroblocks of that slice (In this context, note that inter coding is designed to exploit temporal correlations (across multiple frames)). You can read my earlier articles on video coding here:

We will start with intra coding for the luma blocks of a macroblock. For each 16×16 luma block, a decision should be made whether to encode the block with 16×16 resolution, or as 16 4×4 blocks. The encoder runs the prediction algorithms for each of the two choices above, and computes the SAE (Sum of the Absolute Error between the actual 16×16 block and the predicted 16×16 block). The encoder will pick the choice with the least SAE. Note that the predicted blocks are obtained using neighbouring blocks, exploiting space correlation in the slice.

For encoding as a 16×16 block, there are 4 predicted 16×16 blocks possible. Of course, the predicted block resulting in the least SAE is used to represent this choice. For encoding as 16 4×4 blocks, there are 9 predicted blocks for each 4×4 block. Again, the predicted block that minimizes the SAE is chosen for each 4×4 block. Thus, for making the choice, the encoder needs to go through the following number of SAE computations: 4 predictions * 256 pixels for the one 16×16 block + 9 predictions * 16 pixels for each 4×4 block * 16 4×4 blocks = 3328. Several algorithms are proposed for reducing the complexity of making this choice; these are beyond the scope of my blog posts.

We will now examine the 4 predicted 16×16 blocks, which are illustrated in the picture below. The vertical mode has the immediate top row of pixels replicated across the block. The horizontal mode has the pixels on the left column replicated across the block. The DC mode fills each pixel with the mean of the pixels from the left column and the pixels from the top row. The plane mode uses a linear function of the top row and left column of pixels to generate values for the entire 16×16 block of pixels (this mode is good when the luma varies smoothly over the block).

The 9 modes for the 4×4 blocks are illustrated below.The vertical, horizontal, and DC modes are as described above (for the 16×16 blocks).

For modes 3-8, the value of a pixel is computed as a weighted average of the neighboring pixels. The pictures below indicate how the predicted values are computed for the diagonal down-left, diagonal down-right, vertical-left, vertical-right, horizontal-down, and horizontal-up modes.

Note that slices in H.264 are encoded independently (i.e., no intra predictions are used across slices), and hence pixels A-H and pixel M (or Q in the 4×4 pictures above) and pixels I-L might not always be available for prediction. If some pixels are not available, the modes which use those pixels are not considered. Only for the DC mode, if E-H are unavailable, they are replaced by the value of D. Note that the DC mode is the only one which can be used when none of the neighboring pixels are available (with a predicted value of 0 for all the pixels).

For the typical YUV420 color space encoding, there are two 8×8 chroma blocks (Cb and Cr). The 8×8 intra prediction schemes follow the 4 16×16 luma modes described above. The same prediction mode is used for both the chroma blocks. Also, whenever any of the 4 8×8 luma blocks of a 16×16 macroblock are intra coded, the chroma blocks are intra coded as well.

For some anomalous content or when the quantization parameters are too small, there is an I_PCM mode where the blocks are sent without any prediction/transformation/quantization. In such cases, sending the original pixels is more efficient with bits than the regular process.

References:
— http://www.naun.org/journals/communications/c-02.pdf
— http://tech.ebu.ch/docs/techreview/trev_293-schaefer.pdf
— http://blog.naver.com/PostView.nhn?blogId=kazama10&logNo=50054794619
— http://software.intel.com/sites/products/documentation/hpc/ipp/ippi/ippi_ch16/ch16_h_264.html
— H.264 and MPEG-4 video compression by Iain Richardson (Pg. 180)
Intra Prediction by Yusuf Adibelli
— Data compression: the complete reference by David Salomon
Detailed 4×4 luma prediction equations (Pg. 11, used in this blog post)
— http://guohanwei.51.net/motion/H.264.pdf
— 

Posted in DC mode, diagonal down left, down right, H.264, horizontal down up, Intra coding, luma, macroblock, plane mode, predictors, SAE, vertical left right | 1 Comment

Overview of the H.264 codec pipeline

In this post, I will present an overview of the H.264 encoding/decoding pipeline — the algorithms and processing used to convert a source RGB input into H.264 video, and to decode a H.264 bitstream into RGB output. Here are links to my earlier posts on video coding:

The picture below illustrates the H.264 encoding/decoding pipeline. Other MPEG standards have a similar pipeline, but the higher compression in H.264 is due to the details. For example, in the earlier MPEG standards, blocks are encoded at a minimum resolution of 8×8, while with H.264 they can be encoded at minimum resolutions of 4×4. I will try to point out the differences between the standards where possible. 
A frame in H.264 can be split into multiple slices, each of which consists of several macroblocks. A macroblock is the fundamental unit of video amenable to coding. A macroblock typically represents a 16×16 region. For the YUV420 scheme, a 16×16 region has a total of 6 8×8 blocks — 4 8×8 luma blocks, 1 8×8 Cb block, and 1 8×8 Cr block.

Macroblocks are typically encoded in left-to-right, top-to-bottom order (also called scan order). Once a macroblock is chosen for encoding, a decision should be made whether the macroblock should be intra-predicted or inter-predicted (exactly how this decision is made will be dealt with in a later post). Intra-prediction is done for the entire macroblock (16×16 resolution), or for each of the constituent 16 4×4 blocks (4 8×8 blocks are used for chroma components). Inter-prediction is done at 16×16, 16×8, 8×16, 8×8, 8×4, 4×8, 4×4 resolutions (this is called tree-structured motion compensation, and will be discussed in detail in a later post), and uses several reference macroblocks from the past (and the future in the case of B-blocks).  

Once the intra- or inter-prediction is made, the predicted macroblock is subtracted from the current macroblock to obtain what is called the residual macroblock (residual for short). This residual is then transformed and quantized using techniques similar to image compression (again will be discussed in more detail in a later post). In the case of B-blocks, re-ordering of slices would be required to insert the B-slices before their reference slices from the future (we will discuss how the decision to make a slice I-, P-, or B- is taken in a later post). After the re-ordering is done, all the macroblocks along with the associated prediction and motion compensation information are entropy-coded using context adaptive algorithms (again to be discussed in a later post). The compressed bitstream then gets passed to the Network Abstraction Layer (NAL) for storage or broadcast over a network. 
After a macroblock is transformed and quantized (and before the re-ordering is done), the macroblock is inverse quantized and inverse transformed to mimic the behaviour on a H.264 decoder. Note that this decoded macroblock has lost information compared to the original macroblock because of the quantization step. This macroblock is then used as is for the prediction of an I-block, or is subject to filters (like deblocking in the picture above) to obtain a reconstructed macroblock which will then act as a reference macroblock for predicting P- and B-blocks. It is interesting to note that the decoder is actually invoked during the encoding process, and points to the significant complexity of the encoder (in comparison with the decoder).  
References:
Posted in 16x16, B-block, chroma, H.264, I-block, inter coding, Intra coding, luma, macroblock, mpeg1, mpeg2, MPEG3, mpeg4, NAL, P-block, Part 10, prediction, quantization, slices, transform | Leave a comment

Video Coding Overview

In the past, we have looked at color spaces in parts 1 and 2, and image compression here. In this post, I will provide an introduction to video coding, and hopefully get into more detail in the later posts.

Consider a video at a resolution of 640×480@30fps (frames per sec). Thus for each second, there are 30 frames (which are images) each at a resolution of 640×480. Representing each of these images as (R, G, B) data would require 640 * 480 * 3 bytes per frame, assuming 1 byte per primary color. Thus, a second of video would take up 640 * 480 * 3 * 30 bytes; that’s about 100GB of data per hour. The goal then is to compress the video into a smaller size.

One straight-forward approach would be to compress each of the frames using JPEG techniques described earlier (this is called the MJPEG codec). JPEG typically gets a compression factor of 10; that would still be about 10GB of video data per hour. Using video coding techniques, we should be able to get to 2GB of data per hour.

Note that MJPEG’s poor performance is due to lack of interframe prediction (Let’s say the video content is a static image repeated over and over. MJPEG encodes each frame independently, and does not make use of the fact that the frames are all the same; it does not make use of inter-frame information). Video codecs like MPEG1, MPEG2, MPEG4, H.264 (which is MPEG4 Part 10) all make use of inter-frame prediction.

In MPEG1-4 (excluding H.264), a frame is the fundamental codec element. Each frame is encoded as one of follows:

  • I-frame (all the macroblocks are I-blocks, and do not refer to frames in the past)
  • P-frame (the macroblocks in this frame are either I-blocks or P-blocks). P here stands for Predicted frames, and P-blocks make reference to I-frames in the past. I will go into the details of inter-frame prediction in a later post.
  • B-frame (the macroblocks in this frame are I-blocks, P-blocks, or B-blocks). B here stands for Bi-predicted frames, and B-blocks make reference to I- and P- frames in the past and in the future.  

H.264 is the most advanced of the afore-mentioned codecs. It typically provides 1.5 times better compression than MPEG4 Part 2, provides techniques for better error resilience over lossy networks, better parallelization of decoding, in-loop deblocking to smooth rough edges, amongst several other improvements. From now on, I will present most of the video coding techniques in the context of H.264.

In H264, a slice is the fundamental codec element (as opposed to the frame in MPEG1-4); an entire frame can be treated as one slice (one slice per frame), or a frame can be split into multiple slices (multi-slicing). Each slice consists of a set of macroblocks, and can be decoded independently of other slices. Multi-slicing provides error resilience over lossy networks, and also aids parallelization in decoding a frame. I will get to different slice allocations like Flexible Macroblock Ordering (FMO), Arbitrary Slice Ordering (ASO), and Redundant Slices (RS) in a later post.

Each slice, then, consists of a sequence of macroblocks. There are 5 types of slices possible:

  • I-slice consists only of I-blocks (which can be decoded without referring to other frames)
  • P-slices consist of I- and P- blocks
  • B slices consist of I-, P- and B- blocks
  • SI slices 
  • SP slices 

SI and SP slices consist of a special kind of I- and P-blocks respectively to facilitate switching between different bitrates of the same stream. We will look at SI and SP slices in a later post.

In the next series of posts, I will take a deeper look at various aspects of H.264 in the following order:

  • Overview of the H.264 codec pipeline 
  • Intra-frame prediction schemes
  • Inter-frame prediction schemes (will include a discussion of motion vectors)
  • In-loop deblocking filter
  • Slice allocation schemes
  • SI and SP slices
  • Entropy coding techniques (Variable Length Coding (VLC) and Binary Arithmetic Coding (BAC))
  • Network Abstraction Layer (NAL) units

References:
— Iain Richardson’s excellent articles on H.264 here.
H.264 primer
— http://en.wikipedia.org/wiki/Motion_JPEG
— http://en.wikipedia.org/wiki/MPEG-1
— http://tech.ebu.ch/docs/techreview/trev_293-schaefer.pdf
— Good discussion of NAL Units and slices here.
— Nice paper on the H.264 standard.
— H.264 and MPEG-4 video compression by Iain Richardson (Pg. 164 for slices etc.)
— http://ip.hhi.de/imagecom_G1/assets/pdfs/csvt_overview_0305.pdf
— Overview of NAL
— http://en.wikipedia.org/wiki/MPEG-2
— http://en.wikipedia.org/wiki/MPEG-3
— http://en.wikipedia.org/wiki/MPEG-4
— http://www.kanecomputing.co.uk/pdfs/compression_ratio_rules_of_thumb.pdf

Posted in ASO, B-block, FMO, frames, h264, I-block, macroblock, motion vector, mpeg1, mpeg2, mpeg4, mpeg4 avc, mpeg4 part 10, NAL units, P-block, RS, slices, video coding | Leave a comment

Image Compression Techniques

We have covered color spaces in parts 1 and 2 of this video coding series. In this installment, we will go over image compression in some detail. This should provide a good introduction to the basics of video coding.
An image is a rectangular array of pixels at a particular resolution (say 640 x 480). Each of these pixels is represented by R, G, B values (one byte for each color). Thus, the image size would be 640 * 480 * 3 bytes. The goal is to compress the image into a smaller size.
The steps involved in compressing an image are as follows:
  1. (Optional) Convert from RGB color space to YCbCr. Chroma sub-sampling can then be applied to reduce image size with little impact on human perception. For high fidelity images, this step can be skipped.
  2. Divide YCbCr components into 8×8 blocks. Apply DCT (described below in more detail) on these 8×8 blocks. DCT transforms the 8×8 spatial blocks into 8×8 frequency blocks, with the higher frequencies at the bottom right. The goal is to move into the frequency domain, and be able to sacrifice information at higher frequencies to achieve compression. The reason this works is that the human eye is much more sensitive to variations at lower frequencies than at higher frequencies. At this point, the conversion is still theoretically lossless. We will discuss some practical considerations below.
  3. Use a quantization matrix to scale down the 8×8 DCT values obtained in step 2 (the matrix contains the scaling factors to be used to divide each of the 8×8 DCT values). Note that the higher frequencies are typically scaled down by a larger value (given that the human eye is much less sensitive to those frequency ranges). The scaled values are then rounded to the nearest integer. This is the only lossy step in the encoding process. 
  4. Collect all the values in the quantized 8×8 frequency block into one string. Use a zig-zag path to order the values so that similar frequencies are grouped together. Run length encoding (where the encoding is done as a tuple (value, number of repetitions of the value)) is employed in this phase. Certain JPEG standard codes are also used at this point like EOB (End of Bytes; to signify that there are no further non-zero values). 
  5. The whole string is then encoded using Huffman coding, Other algorithms like arithmetic coding are also supported. 

Additional notes:

  • The YCbCr for JPEG images have values in the range [0, 255] (and not [16, 235-240], because the images are only handled digitally, and typically not sent to an analog TV). These values are subtracted from the median of the range (128 in this case) to center them around 0. DCT is then applied to these values.
  • The DCT transformation (akin to Fourier transform without the sines) is usually of the form:
  • x and y are the (horizontal and vertical respectively) co-ordinates in the 8×8 YCbCr block. u and v are the (horizontal and vertical respectively) co-ordinates in the 8×8 DCT block. The α function is typically constant. Each value G(u, v) is obtained by summing up g(x, y) * basis_function(u, v)(x, y). The values of the basis functions makes a pretty picture in grayscale:

  • A typical quantization matrix looks thus (note the smaller values at top-left corresponding to lower frequencies, and larger values at bottom-right corresponding to higher frequencies):
  • The zig-zag path (described earlier) to order DCT values in frequency order (lowest to highest) looks thus:

References:
— http://en.wikipedia.org/wiki/JPEG
— http://en.wikipedia.org/wiki/Discrete_cosine_transform
— http://en.wikipedia.org/wiki/Fourier_transform

    Posted in Arithmetic, Coding, DCT, Discrete Cosine Transform, Fourier Transform, Huffman, jpeg, Quantization matrix, RGB, Run length encoding, YCbCr | Leave a comment

    RGB, YUV, and Color Spaces Part 2

    In the last post, I touched upon the basics of YUV encoding. In this post, we will take a detailed look at YUV coding and some variants like YIQ, YPbPr (and its digital equivalent YCbCr). 
    Given the (R, G, B) values of a pixel, the YUV values are computed as weighted sums as follows:
    • WR = 0.299, WG = 0.587, WB = 0.114 are the weights for R, G, B respectively
    • Umax = 0.436, Vmax = 0.615
    • Y = WR * R + WG * G + WB * B
    • U = Umax * (B – Y) / (1 – WB)
    • V = Vmax * (R – Y) / (1 – WR)

    The RGB values can be computed from the YUV values as follows:

    • R = Y + V * (1 – WR) / Vmax
    • G = Y – U * WB * (1 – WB) / Umax * WG – V * WR (1 – WR) / Vmax * WG
    • B = Y + U * (1 – WB) / Umax

    These transformations can be represented as matrix operations as follows:

    Note that YUV encoding is used in PAL (Phase Alternating Line) and SECAM (Sequentiel Couleur a memoire, Sequential Color with Memory) analog TV encoding systems. PAL is used throughout most of Europe, Africa, Asia, Australia etc., while SECAM is used in France, Russia etc. The NTSC (National Television System Committee) standard used in North America uses a slightly different encoding called YIQ (I stands for In-phase, and Q stands for Quadrature; IQ are different axes, rotated 33o from the UV axes on the same plane). The Y calculation is the same, but IQ are computed as follows:
    While YUV and YIQ are used in the context of composite video, YPbPr is used in the context of component video (YCbCr is the digital counterpart to YPbPr, and is obtained after applying scaling and offsets to YPbPr). YPbPr is obtained from RGB values as follows:

    The values of KB and KR above are determined based on the RGB color space used (this is dependent on the TV circuitry rendering the colors). Note that RGB values are in the range [0, 1], and Y is in the range [0, 1], PBPare in the range [-0.5, 0.5]. For example, Standard Definition (SD) TV uses KB = 0.114 and KR = 0.299 (ITU-R BT.601 Standard), while High Definition (HD) TV uses KB = 0.0722 and KR = 0.2126 (ITU-R BT.709 Standard).

    YCbCr is obtained from YPbPr as follows (Y is in the range [16, 235], CbCr are in the range [16, 240]):

    So, why are the values scaled and offsets added to obtain YCbCr? Typically, the YCbCr encoding is used in DVDs, and needs to be converted into an analog signal to be fed into a TV. However, this signal processing can excurse beyond the levels acceptable to a TV, and will result in clipping artefacts.  On computers where analog signal processing is not an issue, YCbCr uses the full [0-255] range, for example, in the JPEG/JFIF standard.

    References:
    Posted in chroma sub-sampling, color space, encoding, gamma correction, hue, luminance, NTSC, PAL, RGB, saturation, SECAM, video coding, YCbCr, YIQ, YPbPr, YUV, YUV420 | Leave a comment

    RGB, YUV, and Color Spaces

    This post will be the first in what will hopefully be a series on video coding. A raw video is a series of frames at a particular resolution. For example, a frame at 640×480 resolution will have 307200 pixels. Each pixel can be represented by a three tuple (R, G, B) corresponding to each of the primary colors. A typical scheme is to use 1 byte (256 values) for each color, so a pixel can be represented with 3 bytes1. Thus, a 640×480 resolution video at 30fps (frames per sec) would result in 640 * 480 * 3 * 30 bytes of data. Several encoding techniques are used to capture the elements important for human perception (anything not important for human perception could be discarded to reduce the amount of data for broadcasting to TV, recording to DVD etc.) This is where YUV, YCbCr etc. come into play, and will be discussed below.

    Now for some historical issues. The TVs initially were only black and white, and hence only the gray values were transmitted in the broadcast signal (analog at the time). Note that gray values are nothing but equal parts of the (R, G, B) colors. Thus, black is (0, 0, 0), white is (255, 255, 255), and then there are 254 gray values in between. This 1-byte representation of gray values is called luminance, and the symbol Y or Y’ is typically used to represent it (we will use Y in this post2). Note that ignoring color and using just the Y values results in 3 times less data than the original.

    Then, the color TVs burst onto the scene. One needed a TV signal that would work on both the older black and white TVs, and also the newer color TVs. The color TVs are also designed to use the same circuitry as the older black and white TVs for the luminance part (Y), and had additional circuitry for rendering colors. The color information is called chrominance, and has two components U and V. The U is computed as the difference between the blue color and Y, and V is computed as the difference between the red color and Y (roughly speaking, the exact calculations will be given in a later post). This U and V information is then added to the main Y signal using a sub-carrier (A sub-carrier is a signal that is grafted onto a main signal, and carries additional data, in this case the color information). The color TV would then receive the YUV signal, compute the original R, G, B colors, and then render it on to the TV. Note that at this point, we are back to 3 bytes per pixel again.

    Given that the bandwidth of the TV signal is limited, there is a need to reduce the amount of bits per pixel. Also, the quality of picture on the initial color TVs was constrained more by monitor technology than the bandwidth of the TV signal, so fidelity to the original signal was not an emphasis. Given that the human eye is typically more sensitive to luminance than chrominance, the 1 byte of luminance per pixel is not sacrificed any further. The chrominance components are sub-sampled to reduce the amount of bits per pixel (to reduce the bandwidth of the TV signal). This is done in two ways3. First, only the chrominance components for alternate scan lines are retained (this already cuts down the amount of UV data by half). Second, for every group of 4 pixels, only 2 UV values are retained instead of 4 UV values (this cuts down the UV data by half). With these two techniques, the amount of UV data that gets transmitted is one-fourth the original amount.

    Alpha: A fourth byte is sometimes added to represent the alpha value of a pixel. The alpha value indicates the transparency of a pixel. This is useful, for example, to overlay transparent logos on top of video.
    Luminance/Luma: Luminance is the relative intensity of a pixel with respect to a reference white color. It’s a theoretical term, and is typically represented using the symbol Y. In video engineering, because of the non-linearity of the intensity of the electron gun with respect to the applied voltage in the original cathode ray tube, gamma encoding of the RGB values into R’G’B’ values is typically done (to nullify the non-linearity). The weighted sum of the R’G’B’ values is called luma, and is typically represented by the symbol Y’.
    Chroma Subsampling: The encoding described is called YUV420. There are other encodings like 411 (UV data cut to one-fourth), 422 (UV data cut to one-half) etc. For a region which is J horizontal pixels wide and 2 pixels high, ‘a’ represents the number of U and number of V components in the first row, and ‘b’ represents the number of U and number of V components in the second row. Thus, J = 4, a = 2, b = 0 represents the YUV420 encoding.

    References:
    — http://en.wikipedia.org/wiki/YUV
    — http://en.wikipedia.org/wiki/Chroma_subsampling
    — http://en.wikipedia.org/wiki/Luminance_(colorimetry)
    — http://en.wikipedia.org/wiki/Luma_(video)
    — http://en.wikipedia.org/wiki/Chrominance
    — http://en.wikipedia.org/wiki/Hue
    — http://en.wikipedia.org/wiki/Saturation_(color_theory)
    — http://en.wikipedia.org/wiki/Subcarrier
    — http://en.wikipedia.org/wiki/Gamma_correction

    Posted in absolute color space, chroma, chroma sub-sampling, chrominance, color space, encoding, gamma correction, hue, luminance, RGB, saturation, video coding, video compression, YCbCr, YUV, YUV420 | Leave a comment