March 20, 2019
By Jan Ozer Contributing Editor
Featured Articles
For the rest of the Spring 2019 - Industry Sourcebook issue of Streaming Media magazine please click here

Buyers' Guide to Video Quality Metrics

Video quality metrics are algorithms designed to predict how actual viewers would gauge video quality. These metrics are used for a range of activities, from comparing codecs and different encoding configurations, to assisting in production and live quality of experience (QoE) monitoring. In this buyer's guide, I'll identify and describe the most commonly used objective quality metrics and discuss tools that deploy them, though primarily for comparing codecs and encoding configurations, rather than QoE or production monitoring.

Understanding Video Quality Metrics

Table 1 provides a taxonomy of the most commonly used video quality metrics. On the extreme left is subjective Mean Opinion Score (MOS) computed with actual viewers rating videos on a scale from 1 to 5. While not a metric calculated by a computer like the others, subjective MOS is the gold standard because it provides the best predictive value of actual subjective ratings. All other ratings are much less accurate, though they improve as they move to the right.

Note that not all metrics expressed as mean opinion scores are subjective; several video quality measurement tools output objective ratings using the five-point MOS scale. The fact that these ratings are computer generated should be obvious as soon as you glance at the product literature, but I wanted to make this distinction to avoid any confusion.

Table 1. Common video quality metrics and associated features and interpretations. (Click for full-size version)

Working down the features on the left, Basis is the theoretical basis of the metric (ML stands for machine learning). PSNR (peak signal-to-noise ratio) and SSIM (Structural Similarity) are based solely upon mathematical algorithms. This makes them static, and means that they don't improve over time. According to SSIMplus developer SSIMWave, machine learning has contributed to the development of the SSIMplus algorithm, so it has improved over time and will continue to do so. In contrast, VMAF (Video Multimethod Assessment Fusion) is based upon algorithms as augmented by machine learning, so it improves over time and can be trained for better predictive value on content specific datasets, like animated videos for a cartoon-based channel or sports videos for a sports-channel. This largely explains the predictive value rating atop the table.

All the objective metrics in the table are “reference” metrics, meaning that they are computed by comparing the encoded file to the source. Typically, MOS trials don't involve viewing the source video file, but some do. Generally, reference-based metrics are more accurate than non-reference metrics which are used in applications where reference-based metrics are either impossible (no access to source) or impractical (live encoding).

Scoring is the scale used by the video quality metric, which varies by metric. The next few features detail how to use and interpret the metric and also represent the utility of the metric for different applications, which I'll elaborate on in a moment. For example, the “no artifact” threshold is the score at which the video is presumed to be free from disturbing artifacts, with “artifacts likely present” the score at which you'd expect the video to start looking ugly. In lieu of this, you can simply use the rating system shown which details scores associated with excellent, good, fair, poor, and bad quality video.

One feature that makes VMAF particularly useful is that a six-point differential constitutes a Just Noticeable Difference, which is generally accepted to mean a difference noticeable by 75% of viewers. If the VMAF rating of two codecs differs by 2 points it's presumed unnoticeable; while higher is always better, most viewers simply wouldn't notice.

Device ratings is the ability to rate video quality for a particular device, reflecting the reality that video that looks great on an iPhone might look awful on a 4K TV. SSIMplus leads in this rating with dozens of device ratings. VMAF has three ratings, standard, phone, and the recently launched 4K rating.

The final characteristic is ownership. All of the metrics except SSIMplus are open-source metrics, which means they are available on a variety of tools, including some free tools like FFmpeg.

Metric Summary

Here's a brief summary of each metric.

PSNR—Probably the most widely used metric, but also recognized as having the lowest predictive value. Still cited by Netflix, Facebook, and other companies in codec comparisons and similar applications but usage is declining.

SSIM—Slightly higher predictive value than PSNR, and less well-known, but favored by some codec researchers and compression engineers. Usage is declining.

SSIMplus—very functional and highly regarded metric but proprietary so not available in tools other than those from SSIMWave which start at around $995.

VMAF- Invented by Netflix and then open-sourced, VMAF is widely available. Designed for and tuned for use in evaluating streams encoded for multiple-resolution rungs on an encoding ladder, VMAF is the engine behind Netflix's highly-respected per-title and per-clip encoding stacks. Very functional and an up and comer.

Tools

There are multiple products that can compute metrics with five that I use shown in Table 2. These are not the only products out there, but are those I'm most familiar with. If you have a product in this category that's not listed, please contact me at janozer@gmail.com to discuss.

Note that there's an entirely separate class of products with live quality checking from a variety of companies, including SSIMWave (SSIMPlus Live Monitor), Telestream (Inspector Live), Tektronix (Sentry) that measure and monitor video quality to ensure ongoing QoE. These are not the focus of this buyer's guide.

Table 2. Tools for computing video quality metrics. (Click for full-size version)

In terms of a high-level taxonomy for the tools in Table 2, FFmpeg is included as a free way to generate test scores, but obviously lacks features like visualizations that are really essential to understanding metrics. To explain, when you use objective metrics you care about at least two scores, the average value and the lowest frame value.

Why? If you compare video encoded with CBR and VBR, the overall score is usually relatively close. However, the CBR video may have transient patches where quality drops sufficiently to impair the quality of experience. FFmpeg provides a single score with no visibility as to the lowest score, so you're in the dark. In contrast, the Moscow University Video Quality Measurement Tool (VQMT) allows you to output any number of “bad frames” to identify problem, while all others provide visualizations that allow you view how the values change over the duration of the video, which VQMT does as well.

You see this in Figure 1 from the SSIMWave VOD Monitor which is tracking the quality of multiple videos encoded using different per-title encoding technologies. As you can see, several videos have regions of significantly reduced quality. In the SSIMWave tool, you can click anywhere along any of the graphed values and view that frame from any of the test videos. This lets you identify potential problem areas and verify whether a problem truly exists. The bottom line is that if you're serious above your file comparisons you need a tool with a GUI and visualizations.

Figure 1. SSIMWave's VOD Monitor tool offers very flexible result visualization. (Click for full-size version)

Beyond FFmpeg, the tools break into roughly three categories. VQMT and VideoQuest are single user desktop tools used primarily for experimentation, while SSIMWave VOD Monitor is a Centos-based multi-user experimentation and production monitor. Hybrik is our only cloud entry which means virtually unlimited high-volume production. Working through the features in table 2 will help make these distinctions clear and I'll summarize each product at the end of this discussion.

In terms of operational paradigm, VQMT and VideoQuest can compare up to two encoded files to a single source in the GUI with similar command line operation. Both the SSIMwave and Hybrik tools can compare multiple files to a single file in the GUI, speeding operation, and with Hybrik you can download a CSV with the results from multiple files, simplifying import and analysis. In contrast, with VQMT, VideoQuest, and the VOD Monitor, you have to copy and paste single scores from single CSV files, which is boring, time consuming, and error prone.

In addition, VQMT and VideoQuest run on the specified operating systems and are primarily single-person tools. In contrast, SSIMWave VOD monitor runs on CentOS with a browser-based GUI that can be accessed by any computer that can access the CentOS computer, while Hybrik is an SaaS web application that anyone with browser and connectivity can access. While VQMT and VideoQuest both offer batch operation, SSIMWave and Hybrik can be driven by a REST API so results can easily be integrated into a production environment.

Cross-resolution refers to the ability to compare multiple resolutions of a file with a single source, which is common when you measure the quality of all files in an encoding ladder. With VQMT and VideoQuest, you first must convert the lower resolution files to YUV files at the same resolution as the source, which is time consuming and requires lots of disk space. In contrast, SSIMWave and Hybrik don't require this so you can compare a 360p file to the 1080p source without prior conversion. SSIMWave takes this one-step further with the ability to compare files with different frame rates, like 30 fps 720p versions of a file with a 1080p60 host.

VideoQuest and SSIMWave can also automatically align source and encoded versions of files, which becomes essential when an encoder adds or drops a frame at the start of the video, a frustratingly frequent event. With VQMT, you have to adjust this manually; with Hybrik it's not possible at all.

Metrics are self-explanatory, and we discussed the ability to output the “bad frames” above. File-related information relates to other, non-metric related data that a tool can glean from a file. As an example, one particular strength of Elecard VideoQuest is the ability to reveal the frame and file information shown in Figure 2. Specifically, it's great to be able to view the frame types in the GOP on the bottom and know that the frames compared are both P frames about the same size. Other screens not shown reveal even more comparative data relating to the test files, allowing a deep comparison and analysis not possible in any of the other tools.

Figure 2. VideoQuest offers displays lots of significant file-related data.

We covered results visualization screens and device ratings above. Outputs refers to how the scoring information is delivered after the analysis is complete. Single-file output means 20 different open, copy, and paste operations to record 20 different scores. In this regard, with Hybrik, you can output an unlimited number of files into a single CSV, which literally can save hours and hours of work in some complicated analysis.

I would synthesize the features table data into the following product-related operations.

Ffmpeg—only for those who can't afford the other tools.

VQMT—Easy to use and fast with a great visualization tool, an extensive selection of metrics and an excellent command line utility. If you're looking to pump out multiple file comparisons, VQMT is a great tool for it.

VideoQuest—Trails VQMT in usability but offers exceptional file-related data and a great ability for viewing and comparing videos. A wonderful tool for deep file comparisons.

SSIMWave VOD Monitor—The only tool with the SSIMplus metric but lacks support for VMAF. Outstanding multiple file visualization, industry-leading device support, and great high-volume and multiple-user functionality. The VOD Monitor also has unique metrics, like the Perceptual Fidelity metric, which excludes visual deficits in the source video to score the encoded results, and the Weighted Average Index, which lets you factor quality variations in the video into a final rating. Overall, a deep, powerful, and highly usable tool.

Hybrik—Video analysis is a feature of the Hybrik cloud encoding platform, not a separate product, which is frustrating because it's too expensive for most users to purchase for analysis-only functionality. Hybrik offers a solid range of metrics and QC tools (see Figure 3), while cloud operation allows Hybrik to plow through many more files than you could process on a single computer. For high-volume processing of longer and/or higher-resolution files, Hybrik can't be beat.

Figure 3. Hybrik's metrics and QC tools.

Note that VQMT and VideoQuest both offer trial versions, and you can request one from SSIMWave. I recommend this as the first step for any of the tools that interest you.

[This article first appeared in the 2019 Streaming Media European Industry Sourcebook.]