Upcoming Industry Conferences
Content Delivery Summit [1 June 2020]
Streaming Media East Connect [2-3 June 2020]
Streaming Media West [6-7 October 2020]
Past Conferences
Streaming Media West [19-20 Nov 2019]
Esport & Sports Streaming Summit [19-20 Nov 2019]
OTT Leadership Summit [19-20 Nov 2019]
Video Engineering Summit [19-20 Nov 2019]
Live Streaming Summit [19 Nov 2019]

Assuring Video Quality
When we're at the point where all encoders work just as well, how do we measure quality? From PSNR to SSIM to ‘golden eyes,' here's a look at how video quality can be assessed.

Objective, Subjective, or Both?

It’s a tall order to cover all those areas in a short how-to article such as this one, so we’ll focus on the basic measurement of the content quality itself, after it has been transcoded. 

During our testing, we opted to use a qualitative testing methodology for the encoded or transcoded content, prior to transmission, that includes two basic factors—objective data and picture metrics—as well as subjective testing. 

To achieve some form of consistency, we needed to decide on which measurements to take. We opted for one objective measurement—peak signal to noise ratio (PSNR)—and one subjective measurement—the use of “golden eyes” for blind subjective ranking. 

PSNR is an index ratio that has been applied for years to still images and is presented as a percentage ratio of the output file versus the source file, based on a measurement differential of decibels between the output file and the source file.

According to Winkler and Mohandas, one primary reason that PSNR is used for after-the-fact file-based quality comparisons, or during live encodes, is the speed at which PSNR can be computed. Another reason is the familiarity that researchers have with PSNR from their days at college, as many of the Ph.D.s in image compression started with still image basic formats such as JPEG.

“Over the years, video researchers have developed a familiarity with PSNR that allows them to interpret the values immediately,” they write. “There is probably no other metric as widely recognized as PSNR, which is also due to the lack of alternative standards.”

For proper video PSNR testing, however, the test needs to be repeated for every single frame, yielding a massive amount of data about each frame that is, more often than not, only viewed in terms of a minimum, maximum, and average rating. In a practical sense, this means that the overall video gets a PSNR rating, but it doesn’t address problems within the video itself.

In addition, neither PSNR nor alternative standards, such as SSIM (structural similarity index measurement), take into account the basis of modern-day video compression. Compression across multiple frames is used to address bandwidth constraints with newer codecs such as H.264, where the codec jettisons information in certain parts of multiple frames if that content doesn’t perceptibly change for several frames before or after the given reference frame (an I-frame) that is being compressed. 

To get to a consistent PSNR measurement on a transcoded video  requires that the PSNR measurement either needs to be performed at the time of transcoding or the transcoded file needs to be decoded back to an uncompressed form (native color space) after the transcoding is complete.

In our testing, we found that PSNR measurements vary from system to system. Measurements taken at the time of transcoding yielded fairly high marks on the system on which they were transcoded, but the marks were much lower when the files were later run through a PSNR measurement system on a single, baseline system.

We ended up using a baseline system to measure PSNR when we discovered that not all professional transcoding systems have the ability to generate PSNR measurements. Further examination of the systems determined also that some systems generated PSNR only during the transcoding process, while others allowed PSNR generation without requiring another round of transcoding. 

Based on this information, we made the move to a single reference platform to generate PSNR measurements and chose a small subset of the 3,000-plus videos encoded on each tested platform to correlate against the subjective blind rankings performed by the “golden eyes” testers. 

There are a number of systems on the market for PSNR testing, ranging from free solutions provided 

by national labs to open source systems supported by universities to commercial tools costing thousands of dollars. One thing to keep in mind when using PSNR or SSIM is that many of the test tools aren’t capable of handling high-definition source content, so your mileage may vary when it comes to choosing a test tool.

For the drill down into subjective blind ranking tests, we chose a subset of files, each chosen to cover a variety of pixel sizes, frame rates, or source file types. 

The results of the subjective tests were interesting: For outputs designed for web or IPTV playback (1,080p, 720p, standard definition, and CIF files), the subjective blind ranking failed to reach the same quality conclusions as the PSNR measurements a hefty 66% of the time. In only one-third of all web outputs did both subjective and objective quality tests choose the same file. 

One could conjecture that consensus cannot be reached by any group of industry veterans, since divergence of opinion is a hallmark of subjective testing. But it is interesting to note that the testers were all fairly consistent in their ranking choices: Of nine test sets, only one of the nine test sets came close to dividing the subjective testers on their decisions; for the vast majority of subjective tests, the ranking of quality within the test sets was quite consistent. 

For test sets that were focused on output files to be delivered to mobile devices, the PSNR and subjective ratings were a bit more consistent, which led to a logical theory: Perhaps PSNR is just not that accurate for high-definition content. This was bolstered by the fact that PSNR showed a tendency toward favoring particular systems—or even codecs—that don’t necessarily correlate to human perceptions of quality. 

In researching the outcomes, I came back to an additional insight by Winkler and Mohandas. 

“Despite its popularity, PSNR only has an approximate relationship with the video quality perceived by human observers,” they write, “simply because it is based on a byte-by-byte comparison of the data without considering what they actually represent. PSNR is completely ignorant to things as basic as pixels and their spatial relationship, or things as complex as the interpretation of images and image differences by the human visual system.”

To prove their point, the researchers show two images side by side in which one has a clear visual degradation and the other appears to have little visual degradation. Both images, they point out, have equal PSNR values.