Achieving Zero Latency for Video and Audio Is a Zero-Sum Game
As an industry, we make a lot of justifications about why video isn't delivered in a timely manner and at uncompressed quality.
Many of these are reasonable, centering on network capacity or intermittency, the cost to scale out low-latency solutions, or even the limitation of off-the-shelf processors to handle 4K Ultra HD or high dynamic range (HDR) content in real time.
But the issue is fundamentally deeper than any of those issues, going to the codecs themselves and the packaging and segmentation that’s sprung up around scalable streaming video, both of which add inherent latency. A few of us have been ranting about these latencies since the advent of HDS, HLS, and even DASH. The move toward OTT live streaming has brought these latencies—or synchronicities, as one industry colleague referred to the issue of latency at Streaming Media East 2019— to the forefront.
To better address latency for streaming, let’s use this article to explore ways to deliver video and audio that absolutely, positively have to be there now (to paraphrase the once-popular Federal Express slogan).
It’s not a theoretical exercise, as can be attested in conversations at trade shows like InfoComm, where corporations and houses of worship are looking to deliver content both locally (through the use of image magnification, or IMAG) with absolutely no latency and remotely (across campus or to distance-learning students). These knowledgeable users, for both operational complexity and cost benefit reasons, don’t want to have to deploy two solutions, a zero-latency one for local delivery and a very-low-latency one for remote users who will expect to interact with the presenter and his or her local audience.
Is the Codec Salvageable?
In the zero-latency local delivery use case, a standard segmentation-packaging streaming approach fails miserably, but the problem starts well before the packaging step, at streaming’s core: the encoders.
It’s not just the encoder’s problem, though, as many of them have been optimized over time to compress our industry-standard codecs. A major part of the problem lies with the codecs themselves, along with the overall deficiencies for zero-latency encoding and delivery.
Discussions around live-streaming encoding and delivery often include a classic three-legged stool illustration, or what one of our interviewees for this article refers to as the “codec triangle” for decision making. The three “legs,” or triangle “sides,” must be in balance for a streaming solution to work. These three areas are speed, quality, and bandwidth. Some substitute the term “cost” for “bandwidth,” but both emphasize the fact that the higher the bandwidth, the higher the cost of consumption by consumers and corporations alike.
Streaming at scale is premised on the idea of saving bandwidth. As such, for on-demand content, the emphasis is placed on the intersection of speed and quality to preserve bandwidth. To eke out the best quality at the lowest bandwidth, video-on-demand encoders are allowed to spend more time than the length of the asset (e.g., 2 hours to encode a 1-hour video file) to create a final product that looks the best it can at a given bandwidth with a given codec.
The competing needs of quality, latency, and bandwidth are illustrated in this codec triangle. While HEVC lowers bandwidth, it does so at the cost of quality and latency, so most zero-frame latency solutions opt for higher-bandwidth intraframe (I-frame) options such as standards-based Motion JPEG or the purpose-built compression codec inside SDVoE. (Image courtesy of SDVoE Alliance.)
To achieve quality over limited bandwidth, the streaming industry makes heavy use of interframe compression, in which a group of pictures (GoP) is aggregated together and compressed across time, with only the differences between adjacent images in the GoP being encoded. These less-than-total-image frames are referred to as P or B frames; the initial frame in every GoP is called a keyframe or I-frame.
Almost all interframe compression solutions, including H.264 (AVC) and H.265 (HEVC), use an IPB approach, and the results are impressive when it comes to saving bandwidth. In many cases, using P and B frames, it’s possible to see upward of 70% aggregated bandwidth savings across a single GoP of 30–60 frames compared to an I-frame-only approach.
Yet for live-streaming delivery, the use of P and B frames has the potential to cause significant disruption. Going back to the three-legged stool, the emphasis shifts to one of timely encoding and delivery. In a live-streaming scenario, speed is paramount, with quality and bandwidth being secondary.
In fact, to achieve true live encoding at zero latency—we’ll define this term a bit later—the timing window is incredibly short: Live content shot on cameras at 60 fps (e.g., 1080p60 or 4K60) requires a frame to be both compressed and delivered every 0.016 seconds, or every 16 milliseconds (ms).
And that’s not even the whole story: While a frame must be displayed every 16 ms, the transmission process takes time too, as does the packetizing process, to move the encoded video into Ethernet packets for delivery across an IP network. That means that the encoding of a frame of video typically must take place in half the time for delivery (i.e., around the 8-ms range) if video is going to be delivered at zero latency.
Which brings us back around to the Achilles’ heel of interframe streaming video: P and B frames. Since the encoder needs to compare multiple frames within the GoP to save bandwidth, the use of these P or B frames inherently adds additional latency.
So what can be done to address the balance of speed, quality, and bandwidth (cost)? To think about what might be, let’s first examine a typical use case where zero latency might be needed.
In a live-venue setting, any latency is enough to cause visual discomfort. We’ve probably all experienced this visual discomfort at some point in settings where the presenter might be right in front of the audience in-person, as well as being projected onto a big screen in the same room.
If the presenter raises her hand, and the encoder requires even a dozen or more extra frames to encode, the result will be a one-Mississippi, two-Mississippi delay between her movement and what appears on the projection screen.
Worse still, if the presenter is using a computer that’s being projected onto a big screen, visual discomfort for the presenter can occur at around three frames of latency if she tries to interact with a big screen while using a computer mouse on that screen.
So if it’s disconcerting to the local audience and to the local presenter, why would compression be used at all?
That is the argument made by the audiovisual (AV) industry over the past decade as it attempted to reach a point where technology advances allowed video signals to be sent at zero latency across IP. The need for zero latency is also the reason that almost all IMAG solutions installed in large lecture halls, sports arenas, and music venues are still primarily running on non-packetized, point-to-point solutions.
The AV industry and the streaming industry both use the term “latency” to describe delay. But where the streaming industry uses “low latency” or “ultra-low latency” to describe, respectively, up to 5 seconds of delay and up to 1 second of delay, the AV industry started off making a much bolder assertion: zero latency.
AV-over-IP solutions such as SDVoE allow multicast transmission of synchronized video data, which can be used in conjunction with hardware-based windowing and scaling units to create the effect of a single large video image across multiple like-kind HDTVs. Unlike traditional video wall scaling, an AV-over-IP solution does not require an expensive matrix switch in addition to the end-point scalers. (Image courtesy of SDVoE Alliance.)
In some ways, this “zero latency” reference was born of necessity, as multiple-input, multiple-output video switches—referred to as a matrix switch, although somewhat akin to an old-school telephone switchboard—were able to deliver a matrix of inputs to one or more outputs, in configurations up to 128 simultaneous outputs, at latency rates that were less than 1 ms.
Switching the Switches
The way these point-to-point solutions first worked in the 1990s was through the use of five-wire RGBHV cables that individually delivered three colors (red, green, blue) and two types of image synchronization (horizontal and vertical sync). The cabling was expensive (several dollars per foot), and the terminations were clumsy BNC connectors. The back of even a simple 16-input, 16-output (16x16) matrix switch would require 160 BNC connectors, and these units ranged up to 128x128 configurations (that were easily the size of a standard refrigerator) to accommodate more than 1,250 individual BNC connectors.
The benefit of these RGBHV (and subsequent HDMI) matrix switches was that interlaced content could be replicated through the cable at absolutely no latency. In essence, a matrix switch was just a really expensive combination signal booster and distribution amp sitting in the middle of a long video cable that could be used to send the signal up to 100 feet with no signal degradation.
A brief side note here: The switch from RGBHV to HDMI cabling added a bit of a twist, as HDMI content was primarily in a progressive format (where the frame is presented as a single image) rather than interlaced (the image is a series of interlaced odd-even lines). While HDMI could support 1080i and 1080p, RGBHV cabling could only support 1080i. The trade-off to progressive content (e.g., 720p, 1080p, 2160p) meant that the terminology needed to shift from zero latency to zero-frame latency. While some solutions still claim zero latency, any progressive content necessitates transmission of a full frame rather than a portion of a frame.
Once the signal needed to be moved beyond the lecture hall, though, even standard RGBHV or HDMI video cabling didn’t work—and in some cases, such as 100-plus-feet HDMI cables, didn’t exist—so a new solution was required. A few years ago, the form of delivery from an end point to the matrix transitioned from expensive, purpose-built video cabling to much less costly structured wiring. Typically, these were inexpensive, unshielded four-pair Cat5e or Cat6 cabling terminated to an RJ-45 or Ethernet connector (or unshielded twisted pair, or UTP) capable of delivering a baseband video signal up to 100 meters (m) or 330 feet.
This switch to UTP inputs and outputs at the video matrix allowed AV integrators to use existing copper Cat5e and Cat6 wiring in buildings, even though the cabling was not delivering IP signals, but even copper Cat6 wiring is limited to transmission distances of 100 m. This use of UTP cabling, though, opened up the possibility of gathering a video from multiple classrooms to a centralized matrix switch. Yet the basic premise remained the same: point-to-point inputs and outputs into a non-IP video matrix switch.
The move to UTP led to some intentional marketing confusion (names such as AV-over-Cat5 or HDBaseT) as IT professionals, seeing the cabling, might assume that it was standard IP-based video delivery. This confusion also led to a few years of unintentional mishaps, such as the fairly regular occurrence when an AV-overCat5e cable—with non-standard power pinouts, compared to traditional Power over Ethernet (PoE) pinouts—was inadvertently plugged into, and ultimately fried, an IT-department Ethernet switch.
“HDBaseT is not a solution to address streaming demands,” says Paul Shu, president of Arista, a company that manufactures industrial computing solutions for healthcare, hospitality, and other mission-critical market verticals. “HDBaseT is intended to address the distance challenges that some pro AV applications encountered, a solution to extend the distance beyond what HDMI can reach.”
Justin Kennington, president of the Software-Defined Video over Ethernet (SDVoE) Alliance, explains just how exacting the expectations were for sub-frame delivery times that had been delivered by these RGBHV cables and, later, the structured wiring of Cat5e or Cat6: “We couldn’t move the industry away from the comfortable, familiar matrix switch until a technology existed that could truly duplicate its performance.” Says Kennington, “An HDBaseT matrix switch [delivers video] in dozens of microseconds, far below the threshold of human perception.”
SDVoE’s zero-frame latency encoders can downscale the incoming video image, allowing multiple encoders’ video images to appear on a single screen. Known as multiview compositing, this many-to-one display scenario takes advantage of Ethernet transport to eliminate the need for an expensive matrix switch. (Image courtesy of SDVoE Alliance.)
The AV industry is now attempting, for the third time in a decade, to replace the matrix switch with the Ethernet switch. According to Kennington, the financials will drive the move—he estimates the cost of a 48-port 10G Ethernet switch at approximately $5,000 versus a 48x48 video matrix switch at around $59,000—as long as the IP-based technology can meet the same zero-frame requirements of UTP or HDMI cabling.
FPGA to the Rescue
One of the solutions the AV industry homed in on, at least 3 years before the streaming industry starting considering the benefits, was the use of a field-programmable gate array (FPGA) to provide massive parallel encoding. AptoVision, a company with expertise in packaging FPGA and Ethernet physical components (“phys” in networking and chip manufacturing lingo), developed the encoding technology that’s now known in the AV market as SDVoE.
Low latency video streaming is on everyone's mind, but figuring out the latency your project needs—as well as how to get it—can be a slow process. Let this article be your guide.
DVB will release the first DVB-I specifications at IBC in September, promising low latency as well as the ability to deploy standalone or as a broadcast-OTT hybrid
Fresh funding expected to take MainStreaming's video delivery network into new territories