Beyond Software: Hardware Processing for Streaming at Scale
I'm going to start off with a realization that hit me while writing this article: Every codec that I've ever used over my 30-year media career started as a hardware-based codec. And even though these codecs worked, almost every one of them eventually went all "soft." And, as a result of moving to software on general-purpose computing platforms, each of those "soft" codecs is much less efficient today.
With the exception of the first few months of my career—when I programmed racks of multimedia slide projectors to a cue track on a reel-to-reel tape, before convincing my boss to get a Pro Tools 442 audio system—the codecs in each media device I've used started as specialized hardware. Here's a quick list that may be ether foreign or nostalgic, depending on how long you've been in the industry:
- My first broadcast field camera (ENG for you news folks) was BetaSX, which used an MPEG-2 hardware encoder.
- The first small form-factor prosumer camera I used, the Sony mini DV-based VX1000, was based on an MPEG-2 encoder to fit content on a small cartridge.
- My first HD camera, using the same mini DV tape, was a Canon XL2 that utilized—in a really odd way—an MPEG-2 encoder to encode "almost 1080p" and then a hardware decoder to extrapolate "almost 1080p" to actual 1080p for playback.
- The first videoconferencing system I used, the Rembrandt II/VP, had an MPEG-2 encoder and encryption module. (It was the MilSpec/Department of Defense version, but I assume even the enterprise version had the same.)
- All of the videoconferencing systems I consulted on, from Polycom to PictureTel to Lifesize, used hardware encoders, including H.261, H.263, and H.264 codecs, long before there was even a thing called streaming. In fact, hardware encoders were so embedded (no pun intended) into videoconferencing that my consulting recommendation to move toward soft clients (desktop apps to allow participants to join a videoconference from their desk) received a response somewhat equivalent to Bill Gates' famous quote about not needing more than 640KB of memory. In other words, hardware was the present and future, until Apple introduced Mac-software-based QuickTime Videoconferencing and Microsoft followed suit with Windows-based NetMeeting.
- The first Internet Protocol Television (IPTV) encoders I worked with were all MPEG-2 hardware running on ASICs (more on those in a bit) for hundreds of channels that could be tuned by quality, bandwidth constraints, or even targeted geographies without requiring a reboot or software reload.
- AV integration solutions, which required zero latency, all relied on hardware for encoding, trans-rating, and scaling in order to stay within one field of video (that's 16 milliseconds, in case you're keeping score). Even to this day, when there's a need to design a video matrix or synchronize multiple monitors in a given room, a combination of a master clock and hardware decoders is used to maintain consistency for devices that need to run flawlessly for weeks, months, or even years.
- Early H.264 adoption in streaming, once MPEG-1 and MPEG-2 encodes moved toward software but were still unable to generate compression efficiencies at scale, required specialized hardware.
This is by no means an exhaustive list, as there are numerous silicon-based H.265 encoders in the field in rugged, fanless cases that use just a trickle of power to do their encoding task. We've covered a number of these in past articles, spanning market verticals such as oil and gas exploration, remote military operations, manufacturing, and the automotive sector. But this article is geared toward understanding the benefits of purpose-built hardware encoders and decoders as they relate to real-world streaming and video transport scenarios.
It's also exactly why Dom Robinson and I explored the "greening of streaming" in an article a year ago, in which we recommended the industry add a third P (power) to the two other P's we've gauged streaming solutions by: price and performance.
Whether you use the cloud or a hybrid solution to encode your live or on-demand streaming content, high-efficiency encoding has come of age. This article explores the math and science behind these at-scale solutions that are not only good for business, but also good for the environment.
Why is software so popular? The short answer is its flexibility in programming video-specific workflows in the field, using field-programmable compute engines.
Around the end of the first decade of streaming, the H.264 codec had been optimized enough to work with general-purpose computing chips (what we'd call CPUs) and was soon optimized to take advantage of graphics processor chips (GPUs) to output more content from a single generic server. But unlike MPEG-1 and MPEG-2, in which numerous high-density hardware solutions arose that fit the five-nines model for telecom operations (99.999% uptime) and allowed for early IPTV delivery, the move to streaming with H.264 went in the opposite direction: more software-based workflows.
It's a fairly typical pattern. In the first few years of a new codec's lifecycle, dedicated hardware is required to keep encoding times within a reasonable range. But as codecs mature and feature sets solidify, most encoding moves to a purely software solution that can run on general-purpose CPUs and, if more raw processing is needed, CPU workloads can be pushed on to a high-end GPU sitting in the same server.
Yet these software solutions aren't all that efficient, certainly not at scale, consuming more processing power—and actual power—than should be needed.
And that's where high-efficiency processors that go by strange acronyms like ASIC, DSP, or FPGA come in to play: At scale, these processors produce more quality encodes faster and at a much lower power rate than standard CPUs and GPUs.
Why Hardware Instead of a 'Generic' Server-Based Cloud?
I'll explore the acronyms later, but if you're still with me on this look at dedicated hardware being used to compress at highly efficient power rates, I'm sure one question is on your mind: Why does any of this matter if we all have access to significant cloud resources?
One reason is that the cloud isn't all that efficient. Yes, it's accessible 24/7 across the globe, which is convenient and means that instances can be spun up rapidly and torn down almost as quickly. This is perfect for services that don't require dedicated hardware, but it's not so perfect in the fact that the generic computing hardware on which these instances run needs to be kept in an always-on operational state that consumes significant power when the generic hardware isn't being used to full capacity.
And then there's the question of exactly what full capacity entails, as most systems architects will consider generic servers to be at capacity when either the processor load or the network interface card (NIC) is above 60% utilization. That's understandable, because systems architects have to accommodate potential overhead and want to avoid peak capacities choking a generic server. But it's not efficient.
Another issue with a generic server-based cloud is that the cloud itself adds inherent delay. On-prem encoding can be used for both on-prem and internet distribution, as we'll see later, with the added benefit of synchronization among encoders, decoders, and monitoring equipment.
Figure 1. Software-Defined Video over Ethernet’s (SDVoE) zero-frame latency encoders can downscale the incoming video image, allowing multiple encoders’ video images to appear on a single screen. Known as multiview compositing, this many-to-one display scenario takes advantage of Ethernet transport to eliminate the need for an expensive matrix switch. (Image courtesy of the SDVoE Alliance)
A Practical Use Case for Hardware
Remember at the beginning of the article when I mentioned AV integration solutions? These are the kinds of solutions that are installed in corporate boardrooms, training centers, or enterprise auditoriums. They're also used in entertainment venues, where multiple cameras and screens need to be synchronized and operating conditions are less than ideal for both local display and live streams.
You choose these hardware solutions when the video absolutely has to stay in sync among dozens of monitors that can be seen from one vantage point in a venue. (I'm looking at you, regional ice rink wiener-dog-race participants that I got to watch a few months back between hockey game periods.)
You've probably experienced the disconcerting effects when a computer-based video mixer (e.g., TriCaster) or even software package (e.g., OBS or Wirecast) has far too many frames of video delay to use as both a video mixer and a source for image magnification (IMAG) within a given venue. The speaker at the front of the room raises her arm to emphasize a point, and a second or so later, the IMAG monitor in the room, be it a flat panel or a Jumbotron, shows her raising her arm.
How do the AV integration solutions get around this unnerving visual delay? Hardware encoding.
Sometimes, compression is added, but with many hardware solutions, the data path is wide enough—much wider than a general-purpose processor or even a GPU—that compression isn't necessary for the local display portion.
One of the best examples of this is a solution designed and licensed by the Software-Defined Video over Ethernet (SDVoE) Alliance for use in a variety of products offered by Black Box, IDK, Netgear, Semtech, ZeeVee, and others. I covered the SDVoE approach 2 years ago, but here's a recap.
The solution offers zero latency, passing through uncompressed 4K UHD up to 4:4:2 video and adding a light compression to content to the maximum level of HDMI 2.0 (meaning full 4K 4:4:4 60Hz support). "Our compression codec, when enabled, adds 5 lines of latency," says Justin Kennington, president of the SDVoE Alliance. "At UHD, 60Hz, that's 7.5 microseconds, which blows away even I-frame-only AVC/HEVC, etc."
What Kennington didn't mention was that the solution does so while also carrying uncompressed 7.1 sounds, adding AES-128 encryption and supporting 12-bit color depth (this means it easily supports HDR10 and HDR10+, which are only 10-bit color space). And it does all of this on a prepackaged hardware encoder/10Gbps network interface that's about half the size of a pack of playing cards. The encoder runs so efficiently that it needs zero airflow across the two chips—an FPGA and a 10G PHY—so it can be shoved into small enclosures.
In addition, it can be combined with half a dozen other encoders in something that fits in a quarter of a rack unit. (Typical dimensions of these multi-encoders run about 6 inches deep, so they can be mounted in AV racks instead of requiring the larger and deeper data center rack that most generic servers need.)
In other words, the performance and power efficiency allow for smaller, cooler devices that can also be put into hibernation while awaiting the next task. Try doing that with your generic server based on decades-old designs.
Types of Dedicated Hardware Compute Engines
Now that I've talked about a practical use case, I'll spend the rest of the article discussing the different types of dedicated hardware.
First on the list is the ASIC (application-specific integrated circuit). An ASIC has the potential to be the most efficient and most powerful compute platform for streaming solutions, but it's also the most difficult to "get right," precisely because it's purpose-built for a specific set of tasks. As such, great care is taken before the silicon is "spun" or committed to pressing/fabrication because any oversight could render the ASIC unusable for its intended purpose.
A handful of companies in the industry have designed and deployed dedicated ASICs. They spin silicon every 2–3 years, but the field life of the products based on these ASICs can easily be a decade or two.
Is there a way to get similar compute efficiency while still allowing for programmability to accommodate changing needs or emerging industry standards? Yes, and that comes in the form of two different system-on-chip (SoC) approaches: DSPs and FPGAs.
Figure 2. NETINT claims significant power reduction, at scale, over CPU and GPU encoding. (Image courtesy of NETINT)
A DSP (digital signal processor) is an SoC that's often used to process baseband signals, such as audio or video elementary streams, that have been converted at some previous step in the workflow via an analog-to-digital converter (ADC). Think of the microphone on your smartphone, which feeds an analog signal into an ADC. This is then processed by a version of a DSP from Texas Instruments (TI) or Qualcomm to be recorded to a file, sent as part of a phone conversation, or both.
There's a true science to working with DSPs, and the high learning curve limited their use to very specialized audio or video products. About 2 decades ago, though, TI realized that it needed to lower the barrier to programming DSPs for video compression, so it began to provide both programming interfaces (via links to integrated development environments, or IDEs) and pre-licensed audio and video codecs under the DaVinci moniker.
DaVinci allowed the programmer to focus on programming in a language environment that he or she understood to tie into video workflow tools that the video architecture team also understood. When combined with a TI DSP, the result was a much more rapid time to market, since the DaVinci software interfaces and integrated codecs eliminated the need to learn how to optimize a codec to the DSP platform.
An example of one of these chips is the Digital Media System-on-Chip (DMSoC) TMS320DM368. Its video processing subsystem was capable of 1080p at 30 fps using an integrated H.264 video processing engine. Beyond the codec, though, this particular SoC has an integrated facial detection engine, an analog front end, a hardware on-screen display, and a number of digital-to-analog converters (DACs) to allow output to a local monitor. It also contains a variety of color depth options via a 4:2:2 (8-/16-bit) interface that allows up to 16-bit YCC and 24-bit RGB888.
Another company that makes DSPs that are targeted at the streaming video market is NETINT, which has offices in China and Canada. NETINT has a DSP-based SoC, called the NETINTG4, which it claims is much more efficient than a GPU or even a software-based CPU solution.
As Figure 2 shows, the SoC approach has the potential to not only scale much easier, but also to do so with significantly lower power consumption requirements. More interestingly, NETINT claims that FFmpeg workflows that were designed for CPU or GPU processing can be easily ported to use the NETINTG4 SoC.
A quick note about these claims, though. While the not-for-profit that I run, the Help Me Stream Research Foundation, normally verifies manufacturers' claims on its test bench before recommending a specific video engine, we've not yet had a chance to do so with the NETINTG4. Still, since NETINT displays this data on the main page of its website, it's worth sharing with Streaming Media readers.
Speaking of claims, DSPs, and DaVinci, the other approach to hardware-based encoding comes in the form of FPGA (field-programmable gate array). The solution I previously mentioned, from the SDVoE Alliance, uses an FPGA alongside the 10Gbps network interface, and the organization has explained to me in the past that it does so because of both power and performance benefits, since FPGAs provide very wide signal processing paths.
In the process of researching this article, I came across a name in the FPGA world that sounded familiar: Sean Gardner. I looked in my phone contacts and noted that there'd been a Sean Gardner at TI around the time I did some benchmark analysis of early DaVinci solutions. So I was curious if it was the same person. It was, and he's not only stayed in the video compute space, moving from DSPs to FPGAs, but he's also brought a similar make-it-easy-on-the-programmer approach to Xilinx for video solutions.
Figure 3. Xilinx provides two types of application programming interfaces as well as a resource manager to handle multiple FPGAs in a single server. (Image courtesy of Xilinx)
In a conversation with Gardner; his Xilinx colleague, Aaron Behman; and me, the two of them joked that they've long thought that the G in FPGA should stand for "green" based on how power-efficient FPGA solutions are. As an example, they ran me through a presentation around the Alveo U30 Media Accelerator. Again, we've not verified these claims at Help Me Stream, but if the numbers are accurate, they're impressive.
The U30 is a PCIe single-slot card that's half-height and half-length, meaning it can fit in even small form-factor computers. The idea behind the U30 is to allow existing workflows in FFmpeg and other video tools to take advantage of an FPGA solution that offloads all video processing from the host CPU.
The card offers both hardware-integrated H.264 (AVC) and H.265 (HEVC) encoding and transcoding. At full power, it consumes 25 watts, so it's on par with power consumption of most midrange CPUs on the market. For that 25 watts, though, the U30 is capable of supporting some impressive transcoding numbers: 2 UHD 60 fps (2160p60) or 8 Full HD high frame rate (1080p60) or 16 Full HD standard frame rate (1080p30) real-time transcodes, including ABR scaling.
Even more interesting is the fact that Xilinx claims to have tested up to eight of the U30 cards in a single server, with the ability to generate up to 256 720p30 transcodes in a single server, as well as adaptive bitrate outputs on a scale that most GPU-equipped servers can't manage.
Gardner notes that the U30 "is deterministic in performance" from a throughput standpoint, but that the true benefit comes from lowering the host CPU to manageable levels.
"From a CPU offload perspective," says Gardner, "this reduction in CPU loading offers lower CPU cost, better thermals, and better economics at a rack level."
In other words, by increasing performance and lowering overall power output, this FPGA solution provides a way to scale streaming without adversely impacting the environment.