Live at Scale: Zeros Matter

Several years ago, success for a live event was measured as ten thousand or one hundred thousand concurrent viewers. A million plus – the scale of a Super Bowl – was relegated to major broadcasters and once-a-year events. A couple of weeks ago, Hotstar supported over 10 million concurrent viewers during the Vivo IPL final, eclipsing the eight million concurrent viewers that watched Felix Baumgartner fall from the stratosphere.

Companies are allocating massive budgets to create or license the most compelling content for their audience, with the aim to make the content accessible on platforms convenient to users and with commercial terms that are reasonable. With all that effort and money going to the acquisition of content and the growing appetite for live content, the desire to reach seven figures is commonplace, regardless of the actual audience.

Today, the ability to deliver to one million concurrent viewers is neither a business nor a technical impracticality, it’s an expectation. And with scale at this level, zeros matter.

Average. Median. 95th percentile. You can throw those numbers out the window. The edge case is now the priority, as the impact on viewers and the loss of revenue becomes material.

Achieving Redundancy Redundancy

As Alfred Borden in The Prestige, sometimes two is better than one.

With the desire to deliver a great experience to every viewer, redundancy is a common topic. The goal for any company is to ensure that the live viewing experience is on-par with – or better than – broadcast. It’s a high bar, but it’s achievable. But as we know, the internet is inherently based on the assumption of inefficiency and failure, that bits get dropped, misdirected, and corrupted. So, how do we think about redundancy in a world that is inherently prone to error? There’s no easy black-and-white answer; redundancy is a spectrum of gray.

To truly be redundant in a cloud-based architecture, you need to think holistically from glass to glass, from capture to consumption. This means:

Multiple cameras
Multiple “first mile” routes to upload content
Multiple encoding systems
If you’re monetizing with ads: multiple ad providers
If you’re monetizing via transaction: multiple commerce gateways
Multiple content origins
Multiple CDN providers
Manifest-Level failover (i.e., alternate media in HLS, secondary BaseURL in DASH) supported by compatible players

Not every company has the budget nor the desire to be truly 100% redundant (content, systems, and people). While companies may want to eliminate risk, they need to make decisions based on economic, operational, and practical considerations. Is 100% redundancy even achievable? Well, for those of us that watched the power go out during the Super Bowl XLVII, it was a reminder that we can’t control all the upstream dependencies.

Redefining Quality of Service in the Context of Live

Since redundancy is an issue of degree – instead of certainty – this changes how we should look at live. In the past, the common view was to think of live delivery from the nomenclature of a conventional NOC – network operations center – watching the bits go by, measuring latency, routing, and packets. As live streaming inherently becomes more complex, we need to expand the scope of what we measure and manage. We’re now in a world where we need a BOC – business operations center – to provide actionable intelligence across all the components of the live workflow, including the dependencies on third parties, from ad providers to CDN providers to client-side player behavior. And, there are social networks. It’s no surprise that the loudest voice is often the one complaining about an issue. For many of our customers, they often hear about issues from social media more quickly than their own programmatic notifications.

As a result, measuring quality of service is paramount. And the metrics for quality of service changes in both depth and breadth. Even in today’s terms, quality of service is typically defined from the perspective of the client experience, i.e., the video player, and the impact of delivery from origin to the last mile, and whether those factors result in rebuffering. With many vendors attempting to address this narrow definition of quality of service, the actionable result is to simply “switch CDNs.”

Quality of service for live has much greater scope than prediction of “last mile” performance.

Monitoring the contribution feed and “first-mile” delivery to measure latency or dropped frames that could affect the quality of the content or create drift in timing.
Monitoring the transcoding process to ensure content is encoded with predictable and consistent throughput as compute can be affected based on the input and output settings, e.g., bitrate of the input stream, SD vs. HD renditions, video or audio processing (e.g., watermarks, audio channel mapping, caption and subtitle ingestion and/or transformation), 30 vs. 60 fps (or normalization to a specific frame rate), codec (e.g., H.264 vs. HEVC), content protection (e.g., encryption or DRM.)
Monitoring origin throughput: Content writes and CDN reads.
Due to the vast amount of data, one often overlooked area is monitoring subnet and POP performance of CDN delivery for both hot and cold hits. It’s not uncommon for an errant group of edge servers – or an entire region – to exhibit degraded performance, but this is often only apparent if you’re inspecting it through the lens of 99/99.5/99.9 percentile.
Measuring client-side performance remains important to understand if unexpected buffering or errors occur due to device limitations, CDN degradation, delayed manifest refreshes, or failures in DRM license acquisition.
If the content is based on any type of authentication or authorization workflow – e.g., TV Everywhere – those third parties should be monitored, especially for appointment-viewing experiences when there could be an influx of viewers. This applies to transactional use cases as well – e.g., PPV sports – where access to content may require handling a large volume of payment processing. And with ad supported models utilizing server-side ad insertion, both ad servers and the third parties receiving ad impressions should be closely monitored.

Just as CDNs have varying capacity and performance from a regional perspective, many of the dependent third parties often utilize public cloud infrastructure, which share similar regional limitations. Only by ensuring measurement takes all internal components and external dependencies into account can we ensure that actionable data is available to ensure a better-than-broadcast experience for every viewer.