Video Encoding 101 - Under the hood of the Codec (Part 1)

The series so far:
Introduction - http://blog.brightcove.com/en/2009/10/video-encoding-101-series
The Beginning -http://blog.brightcove.com/en/2009/10/video-encoding-101-beginning
The Problem Number 1 - http://blog.brightcove.com/en/2009/10/video-encoding-101-problem-number-1

In general a codec has one task.   Take a data feed in a specific encoding (be it analogue or digital) and run it through an algorithm to try and compress the feed in such a way that the message is no lost during storage and decompression.

In fact that's where the word codec comes from : COmpression DECompresion

Implied in between these 2 transformation states is a new Encoding Specification itself for storage - a language to convert to and store the data in.

To think of this in modern terms compare an email to a text message.

The email may read:

Hi Cam

Sorry for not getting this to you on Friday, I was waiting for confirmation of PD rights from our legal team

So please can you take a look through this overview, it's basically draft 1 so please let us know your feedback

Also there's a chart for questions to ask the developers and risks and issues so please feel free to add to them

Regards
Ivan

If I were then to text this I may come up with:

Hi m8
Soz 4 not getin DIS 2 U on fri, I wz w8N 4 conf of PD rghtz frm our legal team

So pls cn U tAk a L%k Thru DIS OvrVew, itz basically draft 1 so pls lt us knO yor fEdbak. Also ther's a chart 4 :-Qz 2 ask d devLpRz & rskz & isUz so pls fEl frE 2 + 2 dem

Thanks to Transl8it! web service for helping me out here.

So we took a data stream that was 362 characters in length and compressed it to a stream of 282 characters - or we compressed the original by some 22% whilst still maintaining the overall quality to ensure complete transmission of the message contained within the stream.  In this case our brains acted as the CODEC and the Encoding Specification was the English language and its corresponding character set.

Now take this and apply it to video data streams that are 100x as large as text and still try to achieve 22% compression with no loss to the message within the data - enter the video CODEC.

Before getting into the methods deployed by a video CODEC we need to look at the ability to sustain loss of integrity of the data stream whilst still preserving the message quality.  There is a critical threshold around that a medium (text, image, audio, video) can sustain during transitions like compression and decompression spells the difference between the message being understood or not by the receiver. 

Due to the nature of focus for each medium - that's to say how many messages (denoted by their ability to be comprised of substreams) that can be injected per data stream - the threshold rises drastically from Text to Video.  Text has a very low threshold before the message can become indecipherable because it only allows one stream (non substreams) per transmission.  Whereas Video and Audio have much higher thresholds because there are able to use many substreams to make up the overall data stream - an auditory scene is comprised of many different frequencies and a visual scene the same with many different field of view objects each their own substream.   These substreams combine to form the overall data stream and can suffer the loss / corruption of a few before the whole piece falls apart. 

And in fact it's the identification and removal or manipulation of these  substreams that allow modern day codecs like H.264 and MP3 to achieve compression ratios much higher than 25% as we did with the text.

These codecs are called Lossy because by taking into account the higher loss threshold of the medium they will perform much higher manipulation to the individual substreams at the cost of introducing loss via error knowing that as long as they stick within tolerance the message quality won't suffer irrecoverably.

There are indeed Lossless CODECs that swap the balance and ensure quality of the substreams over compression ability.   FLAC for Audio and Lagarith for Video are but two - you can read more here: http://en.wikipedia.org/wiki/List_of_codecs

Looking at my text example above - a Lossless CODEC would be akin to the text messaging there.  Removing spaces and substituting characters that mean the same thing but in a simple format.   A Lossy CODEC would do further work looking at entire sentences and removing whole words and coming up with clever ways to rebuild it during decompression.

In today's online video world we suffer fro the Problem Number 1 and so Lossy CODECs are what we need to make sure we get our data down the limited pipe.  No doubt as internet bandwidth increases globally we'll see a time when Lossless CODECs will come to the forefront and posts like these will go the way of the dodo but we're not there yet and I'm still in a job.  

So how does a video codec like H.264 manipulate the substreams to get a high rate of compression?  It's a process called frame-residue storage. 

Now please note from here on I'm not going to get overly technical.  Partly because this outside the scope of this series and partly because even I struggle to comprehend the real science of the matter!

So high level we need to look at the video data stream in 4D - as a data stream that has a time axis to it.  Each slice of time in an uncompressed video data stream is a 3D image (a Frame) of the scene.   Just like a flip book you use to draw as kids, if progress (flip) along the timeline fast enough (Frames Per Second)  you get seamless movement and in turn video.

Now very simplistically what H.264 does as codec is to look at video as a flip book with all its individual frames and selects what we call Key Frames or I-Frames.   These are full featured frames, as if all substreams pulsed at once to give the complete picture.   These Key Frames are then spaced more or less evenly apart (typically around 1 second).

We now have a heartbeat pulse of 1 second of the video stream - not enough data to give smooth motion considering it takes about 25 Frames per Second for us to perceive it as motion but we can start getting the gist of the overall message across the stuttering data stream.

The next step is then to fill in the gaps.   And what H.264 does is look at the difference, delta, of the current frame in relation to its neighbouring Key Frames.  This difference is referred to as the residue.   Basically what has moved in the video field of view between the last pulse.  By focusing on what has moved we can eliminate what has not moved which turns out to be quite a bit when we talk about very small time slices of video.   This can then be discarded and we have a compressed stream of key frames and change information.

So what gets stored in the H.264 file (and most other lossly video CODECs) are the key frames and the changes in the substreams between them.  This allows for very high compression but the story doesn't end there.   Because other elements can come into place such as bitrates, framesize, subpixel size, P and B frames and much more.

This last list is more a way of tweaking the residue that is detected to further refine how much data can be discarded.

In the next part I'll look at this refinements in more detail.