AOMedia Delivers on SVT-AV1's Promise
In August 2020, the Alliance for Open Media (AOMedia) developed a software working group to “use the Scalable Video Technology for AV1 (SVT-AV1) encoder developed by Intel ... to create AV1 encoder implementations that deliver excellent video compression across applications in ways that remove computational complexity trade-offs for an ever-growing video delivery marketplace.” Testing published around that time indicated that SVT-AV1 had quite a hill to climb to stand out among other AV1 codecs.
For example, in a comparison published a month later, I found SVT-AV1 to be last among the four AV1 codecs I tested (AOMedia’s libaom, Visionular’s Aurora1, aomenc, and SVT-AV1), although only about 3% less efficient than FFmpeg/libaom-AV1. At the time, SVT-AV1 had several critical deficits, including a two-pass rate control that was incomplete. In a 2020 report, Moscow State University (MSU) found that SVT-AV1 was four percentage points behind libaom and 25 percentage points behind Aurora1.
With the recent launch of version 1.0, SVT-AV1 appears to have caught up with libaom in quality, with very definite performance advantages. Its two-pass rate control is tested and proven. If you’re creating an AV1 encoding workflow today that emphasises encoding speed and quality, SVT-AV1 should definitely be on your short list.
About Scalable Video Technology
Let’s start with a brief introduction to what Scalable Video Technology (SVT) is and how it works. According to a recent Intel white paper, “The SVT architecture is designed to maximize the performance of an SVT encoder on Intel Xeon Scalable processors. It is based on three-dimensional parallelism.” Most important of the three is segment-based parallelism, which “involves the splitting of each picture into segments and processing multiple segments of a picture in parallel to achieve better utilization of the computational resources with no loss in video quality.”
This technique is counter to the view that encoding each frame in its entirety delivers the best quality. For example, Avidemux’s encoding guide says, “H.264 allows the encoder to segment each frame into several parts. These parts are called ‘slices.’ The advantage of using multiple slices (per frame) is that the slices can be processed independently and in parallel. This allows easy multi-threading implementations in H.264 encoders and decoders. Unfortunately using multiple slices hurts compression efficiency! The more slices are used the worse!”
So, a big part of what SVT attempts to do is split the picture to gain processing efficiency while retaining quality. Early efforts were not encouraging. As shown in Figure 1 from the aforementioned MSU report, SVT-HEVC was 51 percentage points behind x265, and SVT-VP9 was an astonishing 129 percentage points behind VP9, which made the 4% delta between SVT-AV1 and aomedia seem like a breakthrough.
Figure 1. According to Moscow State University in 2020, early SVT performance wasn’t encouraging.
Now that you’re familiar with SVT-AV1, let’s explore the encoding parameters that I used for my testing and the quality comparisons. For the record, I tested version 1.0.0 of SVT-AV1 as provided by a member of the Intel SVT-AV1 development team. I tested FFmpeg version 2022-06-09-git5d5a014199, downloaded from www.gyan.dev. I performed all encoding tests on a 40-core HP Z840 workstation running Windows 7 on two Intel Xeon E5-2687W v3 CPUs running at 3.10 GHz with 32GB of RAM.
Choosing a Preset
Codec developers create presets to configure groups of encoding parameters that control the encoding time/encoding quality trade-off. This allows codec users to choose the level of cost and quality appropriate for their particular application. Whenever you start working with a new codec or encoder, you should benchmark the codec with your own source footage to explore these trade-offs and make the best decision for you. To do this, you should select several representative test clips, encode them using all of the presets and otherwise consistent settings, time the encode, and measure the quality. With FFmpeg, you control the AV1 preset using the -cpu-used switch, with settings ranging from 0 to 8 and a default setting of 1.
Table 1 shows the average results for two 10-second test clips when encoding with FFmpeg and libaom-AV1. With preset 0, the highest-quality preset, it took an average of 3:24:33 (hour:min:sec) to encode a 10-second test clip (which is why it’s challenging to test with longer clips). With the fastest/lowest-quality preset, it took 1:06 (min:sec). This tells you that on this test bed, FFmpeg/libaom-AV1 isn’t capable of encoding a live stream; in fact, the best performance is close to 7x real-time.
Table 1. Encoding time and quality with FFmpeg and libaom-AV1
For a measure of overall quality, I used Video Multimethod Assessment Fusion (VMAF) computed via the harmonic mean method. To assess transient quality, I used low-frame VMAF, which is the lowest VMAF score for any frame in the test file.
In the Delta row on the bottom of Table 1, the time delta divides the slowest score by the fastest and shows that the slowest took 185.95 times longer than the fastest. You can also see that the overall VMAF difference between the fastest and slowest preset is 3.77. For perspective, Netflix has stated that a difference of 6 represents a just noticeable difference (JND) that viewers will perceive, although other researchers have found that 3 VMAF points constitutes a JND. Either way, it’s not a significant difference between the highest- and lowest-quality preset—particularly, as you will see, compared to SVT-AV1.
To visualise the encoding time/quality tradeoff, I plotted three factors—time, VMAF, and low-frame VMAF—for each preset on a scale from 0 (fastest preset/lowest quality) to 100 (slowest preset/highest quality). You can see this in Figure 2.
Figure 2. Plotting encoding time versus quality for libaom-AV1
Every application is different, and every producer dances to their own particular tune. With my fictional VOD content producer hat on, I see preset 4 as the logical starting point, with a substantial jump in both VMAF and low-frame VMAF. Do I increase encoding costs by roughly 50% to achieve a 0.4 VMAF improvement with preset 3? Probably not. Unless I’m shipping extremely high stream volumes, I don’t consider presets 2, 1, or 0.
As an aside, although we are only looking at encoding time and quality in this analysis, a third factor, bandwidth, is also in play. That is, all publishers should start this analysis with a target quality level they will achieve by choosing a preset and bitrate. With preset 2, the bitrate necessary to achieve that target quality level will be less than for preset 3, so bandwidth savings will increasingly offset the encoding time costs as viewing volume increases.
At relatively low volumes, choosing a faster preset and saving on encoding time is probably the best strategy. If your streams will be viewed hundreds of thousands of times or more, it might make sense to pay more for encoding and save bandwidth. (I explore these issues in an article titled Choosing an x265 Preset—An ROI Analysis.) For most producers, I would assume that preset 4 or preset 3 is the most relevant choice for FFmpeg/libaom-AV1.
Choosing a Preset: SVT-AV1
Now let’s look at SVT-AV1. Table 2 shows the same datapoints for SVT-AV1 presets 0–12, with an actual range of -2–13 and a default of 10. The results reveal several obvious points.
Table 2. Encoding time and quality with SVT-AV1
First, the ranges of encoding time for VMAF and low-frame VMAF are much, much greater. In particular, three presets (10, 11, and 12) are capable of real-time encoding, with preset 9 very close, although the quality disparity is significant, extending to 2 JND by Netflix’s numbers and close to 3 JND for low frame.
Figure 3 charts the encoding time/quality trade-off. From a VOD perspective, it appears that preset 6 is the starting point, with most producers choosing somewhere between 2 and 4. As previously detailed, as the anticipated view counts for your videos increase, you should gravitate toward a higher-quality preset.
Figure 3. Plotting encoding time versus quality for SVT-AV1
In terms of the bigger picture, the range of performance and quality makes SVT-AV1 much more usable than libaom-AV1, enabling even live AV1 applications. I don’t know what configuration options are available within libaom-AV1, but it would be helpful if its developers explored ways to broaden the spread of encoding time and quality to make this codec as flexible as SVT-AV1.
Choosing the Thread Count
Now that we’ve selected a preset, let’s cover threads. This analysis will help you understand which thread count to include in your command string and help you choose the optimal cloud instance or encoding strategy on a multicore computer.
With FFmpeg/libaom-AV1, you control the number of CPU threads applied to the encode with the -threads command. Table 3 shows the analysis that I go through when attempting to identify the optimal setting for any configuration option. The baseline column shows the result when no setting is in the command string, which invokes the default setting. Each subsequent column shows the results from configuring the otherwise identical command string to use one, two, four, eight, 16, and 32 threads on the 40-core HP workstation. The Delta column shows the difference between the highest and lowest scores.
Table 3. Finding the optimal thread setting for FFmpeg/libaom-AV1
You can see the results in encoding speed, bitrate, and three quality variables—harmonic mean VMAF, low-frame VMAF, and standard deviation—the last being a measure of quality variability in the stream. The green background identifies the best score, the yellow background the worst.
In terms of performance, not surprisingly, we see that one thread is the slowest option by far. We also see that while 16 threads is the fastest setting, the performance difference between 16 and eight/32 is negligible. From this, I’d guess that the maximum number of threads libaom-AV1 can utilise is eight.
Surprisingly, the single-threaded encode was the lowest quality in all three measures, although the Delta column shows that the differences are irrelevant. The quality results for almost all other alternatives are identical, so production efficiency should be the focus. Clearly, any setting over eight threads makes no sense, and if you’re provisioning cloud instances, eight should be the maximum as well. But is eight the optimal thread count? Table 4 tells the tale.
Table 4. Finding the optimal thread count for an encoding workstation for FFmpeg/libaom-AV1
Using the average encoding times shown in Table 3, Table 4 computes the number of hours it would take to encode an hour of AV1 video using each thread count. Then, it adds the hourly cost of Amazon Web Services (AWS) compute instances from go2sm.com/awspricing and computes the cost per hour for the four thread counts shown.
Interestingly, you achieve the cheapest cost per hour using a single-threaded machine. Why would this be? Because as shown in Figure 4, the encoding cost increases linearly, while the additional threads deliver increasingly lower speed increases. Going from one thread to two doubles the cost but only increases encoding speed by 1.8x. Going from one thread to eight increases costs by 8x but only increases throughput by 2.99x.
Figure 4. Plotting the increase in encoding speed versus instance cost
Of course, this analysis assumes that the work involved in provisioning and managing many more encoding stations doesn’t outweigh the cost savings. Either way, provisioning encoding stations with more than eight cores likely doesn’t make economic sense, and lower thread counts might be more cost-efficient.
Working Efficiently on Multicore Encoding Stations
The same logic should apply to spreading production encodes over a multiple-core workstation. On a 16-core workstation, for example, you might achieve faster throughput with four encodes using four threads each as opposed to two encodes using eight threads. Of course, running multiple encodes adds some overhead that slows overall operation. For example, on my 40-core workstation, a single encode of the 10-second Football test clip took 4:23 (min:sec). When I encoded eight files simultaneously, the average time increased to 5:49, about 32% higher. Still, if you have the ability to deploy multiple instances on a single workstation, some experiments with different thread values will provide useful direction.
Choosing the Optimal Thread Count With SVT-AV1
Given the previously shared explanation of SVT, you’d expect better performance at higher thread counts, and SVT delivers. Still, as you’ll see, the same analysis does less to sell multiple-core Xeon processors than you might think.
Table 5 shows the encoding speed/quality trade-off associated with SVT-AV1’s -lp switch, which controls the number of logical processors assigned to any encoding task. Baseline is fastest because it appears to assign all logical processors to the task, although baseline is only slightly faster than 32 threads.
Table 5. Finding the optimal thread setting for SVT-AV1
From a quality perspective, a single thread delivers the best quality here, but the delta is irrelevant. This makes encoding throughput and cost the most important factors in choosing the thread count (and -lp value). In this regard, the surprisingly diminishing speed returns from the additional threads dictate the results shown in Table 6.
Table 6. Finding the optimal thread count for an encoding workstation for SVT-AV1
As you can see in Table 6, the jump from one thread to eight threads delivers slightly more in throughput than AWS charges for CPUs, making eight threads the cheapest encoding option by a hair. From there, however, the lessened speed increase means an ever-escalating cost per hour for higher thread counts. These findings suggest that encoding configurations exceeding eight threads might not be cost-effective.
These results come with all of the usual caveats; your findings may certainly vary. I performed these tests on 1080p 8-bit content, and the results for 4K and 8K HDR footage might be completely different. I’m also predicting cloud throughput from results posted by an older desktop machine; results on newer versions may be different. Intel versus AMD is another potential differentiator.
The high-level point is that with both libaom-AV1 and SVT-AV1, you shouldn’t assume that more cores deliver the most cost-effective throughput. If you’re getting ready to scale up your AV1 encoding and you need to figure out which workstations to buy or which cloud instances to provision, a day or two of this kind of testing with your simple footage and target output should provide very clear direction.
This takes us to our quality bake-off.
Here’s the encoding string that I used for FFmpeg/libaom-AV1, with options in green as the defaults. This means that you’d get the same result if you removed them. I like to leave them in because it simplifies comparing the string to those used in other comparisons.
ffmpeg -y -i Football_10.mp4 -c:v libaom-av1 -b:v 1500K -g 60 -keyint_min 60 -cpu-used 8 -auto-alt-ref 1 -threads 8 -tile-columns 1 -tile-rows 0 -row-mt 1 -lag-in-frames 25 -pass 1 -f matroska NUL & \
ffmpeg -y -i Football_10.mp4 -c:v
libaom-av1 -b:v 1500K -maxrate 3000K
-g 60 -keyint_min 60 -cpu-used 4
-auto-alt-ref 1 -threads 8 -tile-
columns 1 -tile-rows 0 -row-mt 1 -lag-in-frames 25 -pass 2 Football_1.mkv
Note that I tested with -cpu-used 8 in the first pass and -cpu-used 4 in the second. That’s because the quality used in the first pass doesn’t impact overall quality. I tested with threads set to 8 for maximum single-encoding- instance throughput on my workstation.
Here’s the command string used for SVT-AV1. For these tests, I wanted to get as close to the same encoding time for both codecs as possible.
SvtAv1EncApp -i input.y4m --rc 1 --tbr 1500 --mbr 3000 --keyint 2s --preset 3 --passes 3 --lp 8 --tile-columns 0 --tile-rows 0 --enable-tf 1 -b
With -cpu-used 4 in the second pass, FFmpeg delivered the files in 4:24 (min:sec; see Table 1). I used preset 3 for SVT-AV1, as it delivered the files in a slightly faster 3:48 (see Table 2). Note that I used three-pass encoding to encode all SVT-AV1 output produced for this article, although the first and second passes are very, very fast. I also used -lp 8 for throughput and to match the libaom setting.
Overall, I tested 17 files ranging in duration from 1 to 4 minutes, with four encodes each to produce output to present in a rate distortion graph and to use to compute BD-Rate results. I’m told that adding the results to present a composite graph is mathematically incorrect, but I find it useful as a general gauge of the overall result. So please don’t show Figure 5 to your mathematically inclined colleagues.
Figure 5. Average results for 17 test files
As you can see in Figure 5, SVT-AV1 wins at lower bitrates, while libaom prevails at higher bitrates. Overall, according to the BD-Rate composite computation, SVT-AV1 produced the same quality as libaom-AV1, with a bitrate savings of 1.36%.
Feeling a bit let down because you read all the way to the end, only to find that SVT-AV1 delivered just a minuscule bandwidth savings? Well, when I last reviewed SVT-AV1, the codec needed to increase bandwidth by 4% to match libaom-AV1 quality and was actually slower as tested.
Now, SVT-AV1 slightly exceeds libaom-AV1 quality while enabling software-based live AV1 encoding. Not bad for version 1.0. While this may not trigger a mass exodus from libaom-AV1 to SVT-AV1, it does enable a completely different set of potential AV1 applications, which can only accelerate AV1 adoption.
During my tests, I had to convert the source MP4 files to Y4M format to encode with the SVT-AV1 standalone encoder. Obviously, operation within FFmpeg would eliminate this and simplify integrating SVT-AV1 encoding into existing FFmpeg-based workflows.
While you can access SVT-AV1 within some FFmpeg builds, it’s single pass only, not two or three pass, which delivers better nitrate control and quality. Apparently, adding the three-pass capability to FFmpeg is a lot of work that probably won’t happen at least until the end of 2022. So, most developers will use the ACT-AV1 app that I used.
Another open question is the continued vitality of the libaom-AV1 codec in FFmpeg, given that AOMedia has focused its software working group on SVT-AV1. I sent a question to a contact at AOMedia about its plans to keep updating libaom-AV1 and its own standalone encoder (aomenc), but hadn’t heard back by press time. Check the website for any updates.
Looking at prominent AV1 publishers, YouTube has been producing AV1 with FFmpeg/libaom-AV1 for years. Since switching over to SVT-AV1 in YouTube’s FFmpeg-based encoding farm would require significant resources for modest gains for VOD production, it seems likely that AOMedia will continue to support libaom-AV1 (and its largest user) at least until full use of SVT-AV1, including three-pass encoding, is available within FFmpeg—and probably a whole lot longer.
Within 24 months, hardware support appeared, encoding became affordable, and AV1 became a much more realistic competitor to HEVC. Here's how the currently available AV1 codecs measure up.
The old realities that used to dictate codec adoption no longer apply. Opening up new markets now matters more than reducing operating expenses. How are HEVC, AV1, and VVC positioned for the future?
BBC R&D finds that AV1 produces better low-bitrate quality than HEVC, but the codec picture will get even muddier in 2020 as MPEG fast tracks VVC, MPEG-5 EVC, and LCEVC