Scale Matters: How to Deliver Five-Nines Streams to Global Live Audiences
Last August, in a Streaming Media Connect interview with Jarred Wilichinsky, Paramount’s SVP of ad operations, Streaming Media Ad Sense columnist Nadine Krefetz asked him what tends to break in live streams at scale. Wilichinsky quipped, “Everything. Everything breaks. With digital, nothing is 100%.”
Large-scale live streams have potential points of failure at every stage, from ingest to delivery, from first mile to last. A camera or encoder could fail at the point of origin. An unforeseen capacity bottleneck somewhere in a maxed-out network could degrade quality or cause a video to buffer or stall for viewers in any destination region around the globe. For premium events like a World Cup final or a Super Bowl halftime show, the impact of audio or video hiccups, delays, or gaps can be enormously costly.
The potential for small or catastrophic breakdowns at any given point means that those developing streaming architecture and solutions and delivering broadcast-quality streams to massive audiences must pay attention to numerous points of potential failure when striving for the high reliability that audiences, content owners, and sponsors demand.
Any streaming workflow is only as strong as its weakest (or least-tested) link. The more massive the stream, unfortunately, the larger the opportunity and the smaller the margin for error. So, what do the experts say about the architectural demands and challenges of maintaining five-nines uptime and broadcast quality when the stakes are too high to let either suffer? And what solutions do they recommend?
Redundancy, Redundancy, Redundancy
The key to maintaining successful streams at scale, according to Adam Miller, CEO of cloud-based asset management platform provider Nomad, is implementing as much redundancy as possible. “When you get to reliability, it’s a tricky topic because there’s about a million places a stream can break,” he says. “If I have a customer or an associate that comes in and says, ‘We are going to broadcast something live, and there needs to be a 100% chance it works,’ then you have to start with a little bit of science and draw a picture of all the touchpoints between the [ingest source and] the point of the person watching it. And you’ll find that there’s probably a hundred touchpoints in there that could break.”
The challenge, Miller says, is to determine how much to invest in redundancy and where to invest it for the greatest return. “We actually put a little chart above three of those touchpoints, and we say, ‘What is the cost to make this redundancy better? What is the effort level needed to make it better?’” he continues. “And, ultimately, we look at those and see a pretty clear path of where to put your money, where to put your budget and time and energy. And when you do that, you’re going to find that out of those 100 potential breakage points, probably about six really become most prominent. And that’s where you can put your energy to help increase the reliability. Rather than someone just saying, ‘Hey, I need multiple encoders,’ you put the energy toward where it’s going to matter the most. That’s what we do when we start looking at redundancy.”
“These are really complex systems we’re building here,” says Peter Wharton, chief strategy and cloud officer of QoS monitoring platform maker TAG Video Systems. “Just try to explain to somebody who’s not from our industry how complex a media workflow is to go all the way from live production through playout, through delivery, through OTT, and all these moving parts and variants and ladders. You need to monitor the whole thing, especially if you’re doing ad hoc building of systems or you’re setting up a live system on demand. Because then you have all these moving parts, and you’re just building something instantaneous. You have to make sure all those moving parts are working.”
Like Miller, Wharton insists that for the best return in terms of improving stream reliability, you have to make wise choices about where you place your emphasis. “Every point in the workflow has a different value. Obviously, the origination and the playout parts feed everything that comes afterwards. So, you really want to make sure that’s perfect. But as you go further down the food chain, every touchpoint that’s going to impact a viewer is impacting fewer and fewer viewers. So, therefore, you have to also make sure that you are monitoring in a way that you’re actually adjusting the cost of your monitoring to the value of the content at each point of the workflow. And that’s the challenge—to get all that right, but still make sure it works everywhere. Because you can’t spend the same amount of money for some edge of a CDN to affect one region that you do at the core.”
Corey Smith, senior director of advanced production technologies for CBS Sports at Paramount and former live operations director at Activision Blizzard, has spent years working in the whirlwind of massive esports events. He places particular emphasis on not just the last mile, but all the last miles of streams that may be crisscrossing the globe. Much of a global stream’s success, he says, depends on distributing traffic intelligently in a well-thought-out multi-CDN approach. Smith echoes Wharton’s point regarding apportioning your attention to the regions where you expect to experience the highest traffic and the value of “predictively knowing how your traffic is going to scale.” He also argues that maintaining reliability as your stream scales in one direction or another comes down to how fluidly you work with the multiple CDNs that are sharing the traffic load for your event.
“When I was at Xbox, we were doing a lot of things to scale large customer events, whether it’s the E3 keynote … or whatever that happened to be on the console that particular day,” says Smith. “We took a lot of pride in actually testing to [the point of] failure. But if you go to Akamai, Edgio, or some other CDN and say, ‘Hey, I’m gonna stress-test my network. Can you help support 2.5–3 terabytes per second of traffic because I want to scale to 2 and a half million concurrents?’ they’re gonna laugh you out the door and say, ‘We can’t absorb that on our network. Let’s try to build a config that makes sense.’ ”
The key, Smith says, is being prepared to redirect traffic and to load-balance intelligently. “A lot of it is knowing where your traffic is going and being multi-CDN-agnostic or being agnostic to a single CDN provider, but also getting feedback and telemetry from your clients, so you can make actual, intelligent traffic decisions in near real time to what your customers are actually seeing,” he explains. “So, if you’re pushing a bitrate of 10 megabits plus, and you don’t own all of the edge ecosystem that you’re actually deploying to, you need to get that feedback so you can say, ‘This CDN provider in this particular region is not doing so well, and this other CDN provider is the one we need to start shedding traffic to.’ You have to build those kinds of telemetry systems into the actual base of the application because—like an encoder —you can’t just flip it off and flip it back on, on-the-fly, during a live event. You have to be able to ebb and flow your traffic across the global internet and do it seamlessly. So, there’s a real art to figuring out both the telemetry coming in from the outside and how the customers are experiencing the event. If the first couple minutes of the event are completely macroblocked and rebuffering, they’re going to tune out and go to your competitors.”
One essential part of load-balancing CDN traffic for global live streams is playing to your CDN partners’ strengths as you determine where to direct traffic by region for maximum reliability and uptime. According to Joshua Johnson, director of solutions architects at cloud CDN EdgeNext, it’s all about knowing where your audience is and knowing, CDN-wise, “who is good there. Which provider is a dominant player in that general area? Who has got the infrastructure? And are you actually relying upon them, or are you relying upon them and their partners? And do you actually know who their partners are?”
“A lot of those partners are only as good as their peering agreements,” Smith concurs. “At some point, it’s still a business conversation. Theoretically, on paper, you can get bits to anybody in the world.” What matters most for large-scale stream reliability, though, “is how you optimise those routes.”
And, of course, it’s important, back at the origin, not to overlook the obvious, Miller says. When show time is upon you, don’t fix it if it’s not broken. “When it comes close to time to deploy, don’t touch anything. People forget that and they think, ‘Oh, let’s just change out this encoder at the last minute.’ If you’re trying to reliably distribute something, touch as little as possible. And if you’re going to do it 10 times, don’t touch it at all. Build it once, and then just leave it and reuse it 10 times. I find that a lot of people forget that golden rule: Don’t go changing things 2 minutes before,” notes Miller.
Monitoring and Leveraging Real-Time Data
There’s no substitute for building robust streaming architecture from origin to playout to delivery, with key points of redundancy along the way. Having a sound multi-CDN strategy is also critical. But as valuable as the “don’t touch” dictum may be when it comes to tech, streaming at scale is anything but a hands-off endeavor. Making it work means relentless monitoring and data-gathering and making on-the-fly adjustments wherever needed to maintain delivery and quality at a high level to wherever you’re pushing the stream.
The need for gathering usable analytics without flooding the zone with more data than you can possibly absorb or analyze applies to measuring performance and maintaining reliability at all stages of the workflow, not just delivery and the last mile. But it’s all about getting data you can use and apply effectively in real time.
“You want to monitor all the points” in your workflow, says TAG’s Wharton, “but you don’t necessarily want to put a thousand monitoring points on screens and have operators staring at them and just get noise. So, you need some intelligence there too that actually can look at all those monitoring points and know that they’re all exactly the same as the origin, that you can look at one and know it’s the truth for everywhere in that workflow. And then when there’s an error, it will tell you where that error is so you can quickly discover it and do remediation because your root cause analysis is also done by the system. So, we’re looking for a level of intelligence in these systems. We’re looking at something that does monitor all these points and actually could go deep enough to look at content in those points and not just say, ‘There’s data flowing here,’” he explains. “A lot of the problems you find might be misrouted signals, where it’s the wrong thing airing in the wrong place because you’re building these workflows where you might be routing away for commercial breaks and not coming back, or you’re breaking away and you’re not getting the right returns. It’s complex, but the tools are out there, and you can really monitor the whole thing very cost-effectively today.”
Similarly, when it comes to CDN load-balancing, it takes judicious gathering and application of real-time data to make a multi-CDN strategy effective. Stef van der Ziel, CEO of cloud streaming platform provider Jet-Stream, describes his company’s monitoring-based approach when doing large-scale streams on a global level: “We have agents across the world who are constantly probing the performance of streams because that gives us global insights on the performance and availability of CDNs. It doesn’t have the fine granularity of user metrics, but we use this so customers can use any player out there without implementing any metrics, and at least we have some basic insights. And then, our customers themselves use services like Conviva or Mux to measure the performance. They can actually feed back that information in real time to our load-balancer, so they can learn their algorithm to switch over to another CDN if CDN performance falls below a certain level.”
But how do you analyze that data once you have it? With so many different sources of information coming in from different places at different times, Paramount’s Smith says, “You need an aggregator, like Datazoom, or some other kind of near-real-time analytics engine that’s kind of pulling those points of data together to give you an accurate picture of what’s going on. Conviva and Mux are great. I’ve used the LTN tool in the past for load-balancing and then switching between different CDNs, but that’s more of a manual API approach. There’s certainly more sophistication these days in auto-load-balancing your traffic. But again, it’s all about analyzing the sources and making a determination of where a customer should actually go.” If you’re simply shifting traffic indiscriminately to offload traffic, he explains, you never “know if you’re sending them to another CDN that could be having issues.”
The data you gather on your stream performance “will never be 100% accurate,” van der Ziel concedes. “You can also have false positives. You don’t want an algorithm to decide to switch over all the traffic to another CDN because you can just blast them away or make the performance even worse. So those are still challenges we’re looking at and how to solve those problems.”
Building Reliable Partnerships
The impediments to delivering reliable live streams often become more manageable when you bring partners into the mix, rather than handling the entire workflow end to end, because it allows you to hand off certain key tasks. But it can also multiply your headaches when your partners aren’t as committed to reliability or as invested in your event as you are.
“Knowing where your traffic failure points are on your network during a live event also means monitoring the partners you’re investing in,” says Smith. “If they’re saying, ‘Don’t worry about this. Just send us your stream. We’ll go ahead and distribute it,’ that would be my first red flag. If you have a tier-one event that you’re trying to pull off, and it has an incredible marketing and PR value to you as an organization, but it appears that nobody cares on the other side, that is a problem.”
Specifically, this means gathering just as many datapoints on what your partners are doing—up to and including content delivery at the edge—as you do on the local touchpoints you’ve put in place. “When you start working with partners and other companies, you need to get those analytics back into your system and be able to watch them too,” says Wharton. “Sometimes you even put your own monitoring at the edge to provide yourself the confidence that they’re actually doing what you think they’re doing.” The name of the game, according to Wharton, is “trust and verify.”
One inevitable question, given the advance of cloud workflows since the pandemic and the preponderance of cloud vendors and options for moving key architecture elements to the cloud, is what are the pros or cons of migrating to the cloud when it comes to maintaining stream quality and reliability as you scale up, without also up-leveling your costs inordinately along the way?
David Hassoun, chief technologist for cloud media solutions at Dolby.io, argues for a hybrid approach, while expressing some reservations about transitioning streaming at scale to the cloud at this point in time. “Especially in the world that we are in now, where not a lot of people are always in the office anymore, and productions happen everywhere, that’s a big factor in why the cloud can be really powerful,” says Hassoun.
But the capacity for remote control is not always an advantage. “The downside is when we talk about things like eyes on glass,” Hassoun continues. “The closer you are to metal, the more you can control it, especially if something goes wrong. Utilising cloud services adds additional risk. But it’s also a necessary element. I believe hybrid is going to give us the consistent longevity that we need and flexibility along with the element of control. There are also big benefits and drawbacks with cost and management, but they all contribute toward making things work with the teams that we have, where they are, and in the situations that we’re going to be up against and need to adapt to. We’ve also seen situations where cloud elements come in for failure scenarios. If we have a problem here, we can automatically revert and direct traffic, as necessary, to wherever they need to go.”
Another key consideration when you’re looking at moving your live-streaming workflow (or elements of it) to the cloud, says Nomad’s Miller, is the nature of your content. “If I have 12 high-quality 24/7 channels, I’m probably not going to put that in the cloud directly as my primary encoder. I’d be sending a tremendous amount of content over the internet, just to encode it up there and reduce it.” In that scenario, he says, the more prudent approach is to encode locally and “take the final feeds and then send them out. That’s going to be a huge cost advantage. A lot of it comes down to what your content looks like and who its recipients are. Eventually, it’s going to land in the cloud. You can’t get away from that. But at what point in the process do you put it there, and for what advantage?”
According to Smith, questions like whether to keep your encoders on the ground are “all part of the challenge of how you build infrastructure that actually scales.” And the best way to meet that challenge can differ widely from one streaming scenario to another, although the potential to move it all to the cloud has advanced by leaps and bounds over the last few years.
“When I was at Activision Blizzard, we were cloud-first,” says Smith. “Everything is moving to cloud because that was the model going from 2019 to 2020. When the pandemic hit, we had to make a massive pivot to cloud master control. We had to take this idea of master control and do full-on cloud-based production with it, which turned out to be very messy at first. But the technology improved, and today, it’s light years beyond where it was in 2019. The ability to do a full production in the cloud, where you’re basically backhauling a studio feed plus whatever your feed is from your venue and mixing a show in cloud as a PCR solution, is here today. The ecosystem components exist today, but it’s up to us to stitch them together. Ninety percent is off-the-shelf parts, and 10 percent is the glue that puts it all together.”
Wharton contends that the need for that “glue” is precisely where many of those doing streams that are too big to fail might legitimately hesitate in today’s environment. “I haven’t seen the kind of orchestration that lets me build a whole workflow for live production in the cloud: do my rehearsal, take it through a test drive, and then shut it down and start it 2 days from now and have it run for the 3 hours of the production and not pay to leave it running for 2 days until the live event—or to sit there and be able to adjust it dynamically during an event,” he says. “I think when we get that kind of dynamic orchestration that we can trust for live production, then the economics and the business model will really fit, and it’ll take off.”
The Three R’s of Streaming at Scale: Reaching Remote Regions
One of the biggest moving targets in large-scale streaming is pushing streams region to region, particularly into international and intercontinental markets, and ensuring reliable service around the world, especially when you don’t know in advance where the audience for a specific event is going to come from. The geo-specific aspect of reaching the last mile presents a multitude of challenges, and as EdgeNext’s Johnson says, “It isn’t getting any easier.”
A lot of it involves leveraging the edge, as Johnson explains, but the last-mile reliability challenge boils down to controlling the infrastructure. When it comes to streaming to remote regions today, he says, “there’s a lot more focus on the ability to leverage ISPs and get your physical infrastructure into these remote ISPs. There’s also a large part of the world that is starting to adopt, surprisingly enough, more of a P2P type of architecture as well,” he continues. “But you don’t have control over that P2P architecture, and while it might bring down your costs, it’s not always providing the best performance in some of these remote regions of the world. There’s less regulation within the ISPs, so they’re doing things to attract business and bring traffic onto their ISP to increase their revenues without always providing an advantage to the end user. While the ISP may very well bring it in, their latency might not be as good as a different provider’s. And now they’re starting to redirect traffic and pull it in. So regulatory changes within certain parts of the world are necessary, especially from the telecommunications end of things.”
Johnson acknowledges that the regulatory picture is improving in some parts of the world where it’s been historically deficient and is moving closer to standardization. But he adds, “I don’t think it’s getting any easier, because the technologies also are changing. Everybody’s trying to adopt different technologies within their streaming, and you have to be able to provide support for all of that.”
“Capacity is really the killer,” says Smith. “I think streaming has benefited from the need for building capacity for other types of large file delivery. Video games have gone from 6 or 7 gigs fitting on a disc to blowing up to anywhere between 80 and 120 gigabytes for a single game download. And if you’ve got a AAA title that has some kind of content update, now all of a sudden, you’re pushing multiple terabytes per second of traffic. You could easily do that with a video-streaming event due to demand. However, the streaming world is benefiting from the content delivery capacity of the world having to build up all of this global performance over the years because of iPhone updates and video game companies pushing these large files.”
But for large-scale streaming events and the CDNs that deliver them, the manageable economies of scaling break down when that capacity goes unused or when it’s not filled consistently or predictably. Looking at the challenges of meeting the needs of large-scale streaming from the CDN’s perspective (and, in turn, how the CDN’s costs and challenges get passed on to those doing the streaming), Johnson says, “Capacity costs the CDN significantly, to have it available and to not actually fill it. Within the CDN world—a world I’ve been in for a long time—we try to keep it at a maximum of 60% capacity usage so you have enough headroom. But at the same time, if you have a blast event in a certain region, how do you actually make sure that you have additional capacity? How do you make sure that you can support it? It’s not just your bandwidth; it’s your server capacity. It is your DNS capacity and being able to handle the constant change, the constant requests, and the directing of traffic and things like that. It can be an infrastructure nightmare at times to do so. And a lot of these data centers understand that you need it. So, now, all of a sudden, their pricing starts going up and up and up. When you’re looking at investing in a new area and trying to make sure that the capacities are there and that you’ve got the performance, do you build it first and then sell to it, or do you sell and then build it?”
Smith argues that the “build-to-sell” issue becomes moot when meeting potential demand becomes critical to a network’s survival. “A lot of these smaller-capacity networks are going to get destroyed on certain events anyway, because their customer base coming from their network is going to have to receive that traffic, whether it’s a large-file download delivery or a live-event stream of some kind. So, by sheer will to survive, they have to build the capacity out and hope that their transit and peering agreements get better, so they can actually offload some of those costs in other places.”
Build for Success, or Build for Failure?
Benjamin Franklin famously said, “Failing to prepare is preparing to fail.” But when it comes to streaming at scale, with its seemingly innumerable potential points of failure, preparing specifically for failure—or at least identifying where your stream might fail and determining exactly what you’ll do when it does—may be the surest way to prepare to succeed, as paradoxical as that may seem.
“There’s a statement that we’ve used often, which is, ‘Do you build the system for success, or do you build the system for failure?’” says Nomad’s Miller. “If it’s successful, you’re going to be in an amazing place, and if it fails—if you planned for failure, that’s great. So which side do you plan for?”
The key, Miller insists, is not just identifying where your stream might break down, but actually playing out as many scenarios as possible from failure through recovery. “I think the problem is that most people don’t follow through the scenarios that could actually happen,” he explains. “An encoder stops, or the internet stops, or a camera stops. They just assume that they’ll be able to figure out their backup plan on-the-fly. In reality, you have to try it. And most people are frantic before these events, so the last thing they want to do is unplug things and see if they really work. But you have to stress-test those systems. Turn on the monitoring, see what flags go off. What do you do then?” There’s only one way to know: “You need to go through that stuff ahead of time.”
Brad Altfest of Agora discusses Netflix's move into live sports streaming and the challenges they and other established platforms face in managing large-scale live-streaming events.