Hardware-Based Transcoding Solutions Roundup: Testing Performance
Live transcoding on the web has quickly evolved into the optimal workflow for streaming live events. In this schema, you send a single efficient stream to the cloud, transcode the stream to produce a complete encoding ladder, package as necessary, and deliver the streams to the origin server or CDN.
This article analyzes several hardware-based transcoding solutions. For H.264, we compare the NVIDIA H.264 and Intel Quick Sync codecs, analyzing performance and output quality for live transcoding applications. For perspective, we also included FFmpeg’s x264 codec using both the medium (default) and veryfast settings. For HEVC, we evaluated Intel’s SVT (Scalable Video Technology)-HEVC, a software-based codec that purports to deliver hardware-like performance; NGCodec’s FPGA-based HEVC encoder; and x265 using the medium and veryfast presets.
In both cases, we measured performance using the encoding ladder shown in Table 1 with 1080p 60 fps source clips. That is, on each tested computer, we tested whether the codec could produce the entire ladder, and if so, how many simultaneous instances of the ladder it could produce. For software encoders, we allowed frame rates to drop to 55 fps, while for hardware-based encoders, we allowed no dropped frames.
Table 1. The standard encoding ladder
In a perfect world, we could have tested all codecs on a single computer to arrive at a uniform cost-per-stream hour. However, the platforms used for hardware-assisted encoding were almost always suboptimal for software-only encoding, which frustrated these efforts. In addition, you’ll get different measures of comparative performance based on machine type, and finding the optimal configuration for software and three-hardware based codecs could easily be the topic of a totally separate article. Long story short, we include pricing information for all test encodes, but you’ll likely have to do a lot more work to identify the most economically effective instance types for your production transcodes.
After assessing performance, we measured quality via standard rate-distortion curves with BD-Rate analysis. We also measured the subjective quality of the 3Mbps streams produced in each category with web service Subjectify.us. The encoding parameters used for each set of tests are identified below.
We measured objective metrics—Video Multimethod Assessment Fusion (VMAF) and peak signal-to-noise ratio (PSNR)—with four 2-minute test clips. These included segments from Netflix’s Meridian and Harmonic’s Football test clips, plus the GTAV test clip (2x the 1-minute clip) and a 2-minute compilation of Netflix clips from Xiph.org, including DinnerScene, Narrator, SquareAndTimeLapse, and BarScene.
NGCodec suggested subjective testing later in the process after the first round of encodings and objective testing was completed. Ordinarily, you would perform objective and subjective tests using the same clips. However, Subjectify.us recommends test clips no more than 20 seconds long, so for these tests, we excerpted the first 20 seconds of each clip and tested those.
Multiple NVIDIA GPUs contain one or more hardware-based encoders or decoders, which are separate from the CUDA cores, freeing both the graphics engine and CPU for other tasks. We tested the NVIDIA H.264 encoder using a G3.4xlarge AWS workstation set up for us by engineers at Softvelum, who have significant experience with NVIDIA-based transcoding to support video producers using its Nimble Streamer cloud transcoder. AWS G3 instances include NVIDIA Tesla M60 GPUs that are used during the hardware encode. As with all AWS instances, pricing varies widely based on commitment level, with the Linux spot price at $1.14 per hour when we tested.
I derived the NVIDIA script from a white paper titled, “Using FFmpeg With NVIDIA GPU HW Acceleration”(membership required), ultimately using the following script:
ffmpeg -y -vsync 0 -hwaccel cuvid -c:v h264 _ cuvid -i input.mp4 -c:v h264_nvenc -presetmedium -b:v 5M -bufsize 5M -maxrate 5M -qmin 0 -g 120 -bf 2 -temporal-aq 1 -rc- lookahead 20 -i _ qfactor 0.75 -b _ qfactor 1.1 output.mp4
This differed from the NVIDIA recommendations in two meaningful ways. First, substituting the medium preset for the recommended slow to improve performance, and second, limiting the buffer to 1x the data rate to minimize bitrate variability. We also changed the keyframe interval from 250 frames to 120. We ran test encodes with the original script and the final, and the VMAF rating of the video produced by the final script was actually a bit higher, 82.19 to 81.82.
Using the final script, we were able to produce two simultaneous encoding ladders on the G3.4xlarge instance for a cost per ladder of about 57 cents per hour. In discussing our findings with NVIDIA, we learned that the company offers much more powerful hardware that provides much great encoding density, which obviously will impact the cost per ladder.
We used the following command script for the medium and veryfast x264 encodes, obviously changing the preset as needed:
ffmpeg -y -re -i input.mp4 -c:v libx264 -preset medium -b:v 5M -bufsize 5M -maxrate 5M -g 120 output.mp4
The NVIDIA-optimized G3.4xlarge computer couldn’t produce a single encoding ladder with the x264 codec, even using the veryfast preset. So, we switched to a compute-intensive C5.18xlarge instance, which cost $.9438 per hour (spot pricing) and produced four simultaneous encodes of 55 fps or higher using the veryfast preset, or a cost per ladder of about 24 cents per hour. Using the medium preset, the system eked out two simultaneous encodes for a cost per ladder of about 47 cents per hour.
We ran two separate sets of tests with Intel Quick Sync, both times using scripts recommended by Intel. The first set, the results of which we presented at Streaming Media East, revealed significant transient quality drops in the Football clip. Intel added the highlighted lookahead switches shown in the script below for the second set of tests, which eliminated this problem:
ffmpeg -re -hwaccel qsv -c:v h264 _ qsv -y -i input.mp4 -filter _ scale _ threads 4 -c:v h264 _ qsv -vf hwupload=extra _ hw _ frames=64,format=qsv -preset 4 -b:v 5M -maxrate 5M -bufsize 5M -g 120 -idr _ interval 2 -async _ depth 5 -look _ ahead 1 -look _ ahead _ depth 30 output.mp4
You can see the difference in Figure 1, which shows the VMAF scores of the Intel Quick Sync clips produced with the lookahead (in red) and without the lookahead (in green) as displayed in the Moscow State University Video Quality Measurement Tool. The green downward spikes each represented very noticeable transient quality drops that the updated encoding parameters with the lookahead obviously eliminated.
Figure 1. VMAF scores for Intel Quick Sync clips with (red) and without (green) lookahead parameters
We encoded with Intel Quick Sync using preset 4. To choose this, we measured the encoding speed and VMAF quality of each preset by encoding the most challenging clip in our test suite (Football) to 1080p at 3Mbps, yielding the data shown in Figure 2. As you can see, presets 3 and 4 present a good balance between speed and quality, although producers seeking to eke out the last bit of encoding speed could justify preset 6 as delivering about 9% better performance with only a minimal quality drop.
Figure 2. Choosing the preset for Intel Quick Sync
Intel created the test station on a cloud system hosted by phoenixNAP that was driven by a single-socket Intel Xeon CPU E3-1585L v5 running at 3.00 GHz, with 4 cores and integrated Intel Iris Pro Graphics that includes Intel Quick Sync encoding and decoding. phoenixNAP doesn’t rent by the hour, but the machine cost was $250 per month, including 15TB of egress data transfer. Best case, if you ran the system 24/7 for a 30-day month, this would translate to about 35 cents per hour.
Interestingly, without the lookahead parameters, the test system could sustain two simultaneous encoding ladders for a cost per ladder of about $0.175 per hour. With the lookahead parameters in the command string, the system could only sustain one encoding ladder at full frame rate for a single ladder for 35 cents per hour. I realize that comparing monthly pricing versus spot pricing isn’t fair, but that’s the data I have.
Evaluating the Output
High-volume publishers care about multiple aspects of the output stream, including quality and data rate variability. As we learned from our 2019 NAB Show interview with Twitch’s Yueshi Shen, when you’re pushing hundreds of thousands of streams, even slight variations in data rate can cause delivery issues. Figure 3 shows data rate graphs of the four 3Mbps streams from the Football clip from the Hybrik Cloud platform’s Media Analyzer feature. You see that the top two streams from Intel and NVIDIA show much less variation than the two x264 streams and are also much closer to the targeted data rate.
Figure 3. Data rate variability of the H.264 streams
Table 2 shows various datapoints regarding the 3Mbps Football stream produced by all technologies. They demonstrate that the hardware codecs were more accurate and less variable, with Intel Quick Sync having a slight advantage over NVIDIA with a lower standard deviation and lower max data rate. If data rate variability is an issue for your live events, you should strongly consider a hardware codec.
Table 2. Stream variability of the H.264-encoded streams
Figure 4 is the overall VMAF rate-distortion curve for the four measured technologies, showing NVIDIA with a very slight lead over Intel Quick Sync and x264 medium, with x264 veryfast noticeably behind. There were some variations among the individual test clips, with Intel Quick Sync producing the highest quality in the GTAV and Meridian clips and NVIDIA substantially ahead of Intel Quick Sync in the Football clip.
Figure 4. The VMAF rate-distortion curve for the four H264 codecs