This video crashed the player after 19 seconds

A lesson in rolling your own crypto

In any complex software, bugs exist, and when working on a large project you encounter them on a regular basis. The more memorable and interesting ones are the ones that have a simple cause, presenting in unusual and sometimes bizzare ways. This bug started with a single video causing a player to crash. This is by itself not that rare a report, however, the following details definitely stood out:

  • It only happened with a single specific video
  • It only happened on Android (but only on certain versions of Android)
  • It always crashed the player exactly 19 seconds into playback

The video itself was being served to Android devices using HTTP Live Streaming, Apple's standard for fragmented video delivery based on the m3u playlist format. For content protection, AES-128 with PKCS#7 padding was used. This is generally referred to as HLS Encryption (or "HLSe" for short). Other formats available for the content, such as MPEG-DASH, didn't present the issue, neither did HLS with FairPlay DRM (or even no drm). The issue was even more specific than "this video", it was "this video, with HLSe"

Luckily for testing purposes this bug was easy to reproduce in Google's ExoPlayer demo app, and with a respectable selection of test devices on-hand, it didn't take long to get error codes and stack traces out of logcat for various major Android versions. The specific errors varied depending on the version of Android on the device, but they all shared a common theme; notably that there was some issue decrypting the media in one of the video segments. The primary exception, and the most interesting "Caused by" exception further down the stack trace, were all variants of these two (differing in wording among the Android versions tested):

java.io.IOException: Error while finalizing cipher
        at javax.crypto.CipherInputStream.fillBuffer(CipherInputStream.java:104)
        at javax.crypto.CipherInputStream.read(CipherInputStream.java:155)
        at com.google.android.exoplayer2.source.hls.Aes128DataSource.read(Aes128DataSource.java:96)
Caused by: javax.crypto.BadPaddingException: error:1e06b065:Cipher functions:EVP_DecryptFinal_ex:BAD_DECRYPT
        at com.android.org.conscrypt.NativeCrypto.EVP_CipherFinal_ex(Native Method)
        at com.android.org.conscrypt.OpenSSLCipher$EVP_CIPHER.doFinalInternal(OpenSSLCipher.java:568)
        at com.android.org.conscrypt.OpenSSLCipher.engineDoFinal(OpenSSLCipher.java:385)
        at javax.crypto.Cipher.doFinal(Cipher.java:1476)

Diving deeper here, code-wise, is a bit of a rabbit hole. The core cryptographic code is written in C and is pretty intimidating if you're not used to it. Luckily, that "caused by" has a nice readable type, notably javax.crypto.BadPaddingException, so that helps us narrow down the issue further; the problem is with the padding. As per the specification from Apple, the padding algorithm being used in HLSe is PKCS#7, which is described in RFC-5652 Section 6.3:

6.3.  Content-encryption Process

   The content-encryption key for the desired content-encryption
   algorithm is randomly generated.  The data to be protected is padded
   as described below, then the padded data is encrypted using the
   content-encryption key.  The encryption operation maps an arbitrary
   string of octets (the data) to another string of octets (the
   ciphertext) under control of a content-encryption key.  The encrypted
   data is included in the EnvelopedData encryptedContentInfo
   encryptedContent OCTET STRING.

   Some content-encryption algorithms assume the input length is a
   multiple of k octets, where k is greater than one.  For such
   algorithms, the input shall be padded at the trailing end with
   k-(lth mod k) octets all having value k-(lth mod k), where lth is
   the length of the input.  In other words, the input is padded at
   the trailing end with one of the following strings:

                     01 -- if lth mod k = k-1
                  02 02 -- if lth mod k = k-2
                      .
                      .
                      .
            k k ... k k -- if lth mod k = 0

   The padding can be removed unambiguously since all input is padded,
   including input values that are already a multiple of the block size,
   and no padding string is a suffix of another.  This padding method is
   well defined if and only if k is less than 256.

If you're anything like me, that description is a little hard to picture in your head. So here's a few examples (all padding to a 16 byte block size):

  • 88 7c 46 66 9a 2f a2 59 4d 1e would be padded with 06 06 06 06 06 06
  • 63 would be padded with 0f 0f 0f 0f 0f 0f 0f 0f 0f 0f 0f 0f 0f 0f 0f
  • f5 5c 6b 13 cf 54 f8 45 c1 ca 67 ec 50 20 12 would be padded with 01

Simple enough then; in plain English, the value of the padded bytes is the number of bytes being padded, up to the block size being used.

Armed with an understanding of how the padding should work, and an error indicating incorrect padding, the next step was to check the code handling the padding for the video. The service in question is written in Go, which doesn't have an implementation of PKCS#7 in its standard library, so the service has it's own implementation. Here it is:

PKCS7Pad(in []byte) []byte {
    padding := 16 - (len(in) % 16)
    for i := 0; i < padding; i++ {
        in = append(in, byte(padding))
    }
    return in
}

Short and simple... and no obvious mistakes. There are even unit tests which seem to assert exactly the behaviour we're expecting. So perhaps we need some more clues. It was at this point a detour was taken to compare what our code was producing to other implementations. Luckily the AES-128 encryption applied to HLSe segments is applied to the entire segment file, rather than just to parts of the file contents, as can happen with other content protection schemes. Producing alternative encrypted segments was easily done with OpenSSL, and we could use our dynamic media packager to get a non-protected copy of segments to encrypt. With a segment downloaded in both clear, and AES-128 encrypted, an alternative encrypted segment was generated using the following command:

openssl aes-128-cbc -K $KEY -iv $IV -in clear.ts -out alt_protected.ts

Doing this revealed a new and interesting piece of information, the output of the encryption from OpenSSL was exactly 16 bytes larger than the output from our Go app (ie. one block). Perhaps the issue wasn't in the padding, but some strange edge case where a final block wasn't being added. Reviewing the Go code again with this new piece of information in mind, the following conditional started to look suspicious:

// If this block does not match the AES block size we need to pad it out
if bytesRead != aes.BlockSize {

At face value, the algorithm simply reads data from the input buffer in chunks of length equal to the block size. It then checks how many blocks were read and pads to 16 bytes if necessary. Finally it passes the 16 byte buffer into the cipher and streams the output to an output buffer. So realistically, we only expect this conditional to trigger on the final block, because that's the only one that will need padding. After some back and forth, something started to become obvious; there's a scenario that isn't explicitly being handled.... what if the final block is equal in length to the block size? The Go code will simply do nothing, asserting that no padding is required. To be sure, let's check the spec again:

the input shall be padded at the trailing end with k-(lth mod k) octets all having value k-(lth mod k), where lth is the length of the input.

So if k is the block size, and lth is the length of the intput, then a block of length 16 would need 16-(0 mod 16) bytes of padding. 16-(0 mod 16) equals 16... So actually this is supposed to have an entire block of padding at the end, with the value of each padded byte set to 16! Finally, a lead! Our Go code isn't spec compliant!

Skipping forward somewhat through testing a patch against the same content presenting the issue, we finally get working playback past the 19 second mark in the video! Ultimately the fix was to rewrite the flawed conditional to instead pad the last block regardless of how many bytes were read. This resolved the issue once and for all!

In summary, all the odd characteristics of this bug report stemmed from the the unlikely reality that out of all the content we've tested with our system, this is the first time that a video (stored as a fragmented mp4, muxed into transport stream segments) has resulted in a segment having a length (in bytes) that was perfectly divisible by 16. Why 19 seconds into the video? Well the segments are 10 seconds long and it was the third segment with the incorrect padding. Why did the crash only occur on some versions of android, and seemingly no other devices? Well it seems like the majority of implementations are resilient to this mistake. The older versions of Android were susceptible, but more modern versions could handle it.

With all questions answered, and the bug fixed, there's nothing left to cover except a few lessons learned:

  • Use existing, battle-tested implementations of standardised algorithms whenever possible.
  • If you do end up implementing one yourself, you need to be pedantic with the spec. Missing an edge case will cost you!
  • Even the really fun and weird bugs can have boring solutions.