Video Encoding 101 - The Problem Number 1

The series so far:

Introduction - http://blog.brightcove.com/en/2009/10/video-encoding-101-series
The Beginning -http://blog.brightcove.com/en/2009/10/video-encoding-101-beginning

So.... the $64,000 question: What is Video and how to do we delivery it over the web?  In its most basic sense video is just a consecutive series of static images presented fast enough that our brains process the incoming feed as a stream of continuous scenic data.

To get further in depth around the structure of video is out of scope of this series but here are some lovely links if you want to get into the really nitty gritty of video and its science:

The key measurement of video is that of quality.   As we discussed previously there are 4 fundamental medium to deliver information over the web: Text, Images, Audio and Video.   These increase from left to right in both their ability to efficiently deliver contextual data but at a cost of an exponential increase of transport data required.  So on the far right of the spectrum occupied by video is need for a good (or higher) degree of quality as the resolution of a data stream is what our brains use to match the incoming signal to our memory banks.   If we can't distinguish an object then we fail in converting (transcoding) the video stream into our thought streams. 

What level of quality is needed for a specific stream is very subjective and largely dependant on the focus of the message (factual/news typically less quality vs. engaging nature films that require high quality).  It can also be incremented (or decremented) by any accompanying audio stream's overall quality.  There are certain rules of thumb about how low you can go and we'll get into that in later instalments. 

The more contextual information you deliver the more inference the consumer can do in terms of deriving unique and targeted messaging.  This is why we mostly prefer to watch the news over listening to a bulletin over the radio.   In the same attention span the provider can pack so much more contextual data for the user to sort through and focus on what is important to them.   It makes it much more personal.

Video is broadband messaging.

As discussed in the last segment video is a massive data stream.   Depending on who you believe our eyes can process anywhere between 100Mbps to 200Mbps.   We also see at a resolution of around 324 Megapixels with our brains sampling this much data in a fraction of a second (although our focus is a subset of this field of vision).

In comparison in the UK the average broadband user has a pipe about 8Mbps wide with a sustained throughput of 25% of the overall pipe.  Or 2Mbps for all their internet traffic.  Or in other words a factor 100x difference between what we see in real life and what can actually be presented to us.

So there we have it - our Problem Number 1 broken into its parts:

  1. Video as a format and medium can be used to deliver upwards of 200Mbps of data to our eyes and brains.
  2. Using the internet for part of this transmission sees that delivery pipe shrink to an average of 2Mbps (and smaller for mobile device delivery)
  3. And we need to maintain a certain level of quality to ensure the message is delivered with enough data to ensure the message is delivered and transcoded properly

All this fuss is really to tackle point 2.  If you're using video in the online world then this is your issue.   And probably why you're here reading this blog series.   Until the world all gets fibre optic cables to our door then we need to figure out how to fit the round peg (video) into the square whole (an Internet originally designed for the delivery of text).

And the weapon of choice?  Compression.   Taking something very large, compressing it into something very small, then decompressing it at a later date and time so that original message/stream is preserved with a high degree of accuracy.

There are 4 topical areas in the process of Video Compression:

  1. The ENCODING SPECIFICATION : all the scientific mumbo jumbo that explains how data is to be held and described (think of it as a language and its associated rules for creating words, sentences, grammar, syntax etc)
  2. The CODEC : the code and process that wraps around all the math and science of the ENCODING SPECIFCATION to COmpress and transform the data using the framework specified by the encoding format and DECompress it back to an uncompressed state (CODEC),
  3. The ENCODER : A software app the wraps around 1 or more CODECs and exposes the properties of each
  4. The DECODER : The sofware app (player) that decompresses the video and presents it back to the user.

I'm going to spend most of my time around the first 2 area and in particular the current industry standard for online video delivery - H.264

H.264 - is a specification that allows for video data (either compressed or not) to be examined in way that allows for efficiency (typically of a very high degree) savings to be made and allowing it to be pack into something much smaller then its parts. 

H.264 is by no means the only specification out there: we have Windows Media VC1, ON2, XVID to name but 3.  I'm focusing on H.264 mainly because its the closest thing to a standard we have right now for online video with both ISO and ITU setting it as their preferred codec.   With those 2 on board you can't get any more "standard" these days.

In the end though all mainstream codecs pretty much do the same thing.   Treating video as a continuous set of full-framed images they have figured out a way of removing many of them and replacing them with vector representations of their changes across the time frame.

It just so happens that H.264 (MPEG) have figured out how to do this realllllly well, got some serious industry bodies to back them and not tie themselves to any major video platform/device (yes Microsoft even has signed up for it!).

So how H.264 and other codecs do this compression is a secret sauce that distinguishes them.   But in the end it all boils down to identifying Key Frames, which are effectively full-framed images in the video time stream that will be used as a starting and end point of any vector analysis.  

Once identified in the most basic terms the specifications offer ways of describing changes between these 2 end point Key Frames and the resulting vector information is much smaller in data size then frames they will replace.   Just like blueprints that describe how to build something is much smaller then the actual built object, so the compression part of the specification is how to create blueprints for the series of images between 2 key frames.

All they need at later point in time is someone who understands how to read those blueprints to rebuild the object as it was described.

But nothing is perfect, and this blueprint creation is not (yet) flawless.   In the next post I'll look in more detail on how this vectoring is done and where things can and do go wrong.   From there we can then look at strategies to correct them.   In the end we'll be seeking great quality and in turn solving our Problem Number 1.