A Perceptual Metric for Production Testing (Submitted and Accepted in Journal of Graphics Tools, 2004) 

Hector Yee

Abstract 

This paper describes a perceptually based image comparison process that can be used to tell when images are perceptually identical even though they contain imperceptible numerical differences. The technique has shown much utility in the production testing of rendering software. 

Introduction 

Rendering software in a production environment needs to be stable and bug free. Entire movie production pipelines depend on the software to function correctly without crashing. In order to ensure such stability, rigorous testing of the software is needed whenever the code changes. Testing requires frequent re-rendering of canonical test scenes to ensure that no bugs are introduced in the development process. The images from the test scenes are then compared with previously generated reference images to ensure that no significant changes are visible. A pixel by pixel comparison between reference and test image does not always work because a small change in code might result in tiny pixel level misalignments or intensity changes that are not visible but nevertheless produces pixels that are not numerically identical. 

There are many sources of imperceptible changes. One major source of change comes from sampling. For example, if the anti-aliasing scheme were to change from one version of the renderer to the next, the pixels on the edges of objects might have slightly different values. The same thing might happen in the shadows of objects as soft shadow algorithms evolve over time. These imperceptible changes induce many false positives (bugs which are not actual bugs) in the rendering tests. Having a perceptually based error metric allows us to weed out many false positives in the rendering test suite without having to constantly update the reference images. 

Previous Work 

Daly proposed the Visible Differences Predictor (VDP) in [Daly93] for predicting the probability of detection of differences between two images. The Sarnoff Visual Discrimination Model [Lubi95] is another popular image metric. We will focus on the VDP as it is the closest to our model. The VDP gives the per-pixel probability of detection given two images to be compared and a set of viewing conditions the images are seen under. Daly’s model takes into account three factors of the Human Visual System (HVS) that reduce the sensitivity to error. The first, amplitude non-linearity, notes that the sensitivity of the HVS to contrast changes decreases with increasing light levels. This means that humans are more likely to notice a change of a particular magnitude at low light conditions than the same change in a brighter environment. Note that the HVS is sensitive to relative rather than absolute changes in luminance. Secondly, the sensitivity to changes decreases with increasing spatial frequency. For example, a needle is harder to spot in a haystack than on a white piece of paper due to the higher spatial frequency of the haystack. Finally, the last effect, masking, takes into account the variations in sensitivity due to the signal content of the background. 

We use an abridged version of the VDP in the same way as Ramasubramanian et. al. [Rama99], in that we drop the orientation computation when calculating spatial frequencies. We also extend [Rama99] by including the color domain in computing the differences. By using an abridged VDP, we gain some speed increase over the full VDP, which is essential when testing thousands of images at film resolution. Also, a user specified field of view lets us control the sensitivity of the error metric. For example, we might specify a front row theater viewer for a very conservative metric.  

Implementation 

Assuming that the reference image and the image to be compared are in the RGB color  space, the first step will be to convert the images into XYZ and CIE-L*A*B space. XYZ is a color space where Y represents the luminance of a pixel and X, Z are color coordinates. CIE-L*A*B is a color space designed to be perceptually uniform, where the Euclidean distance between two colors corresponds to perceptual distance. L also represents luminance and A, B are color coordinates This conversion step is described in [Glas95]. 

The following steps compute the threshold elevation factor F, or how much tolerance to error is increased, similar to that found in [Rama99]. A spatial frequency hierarchy is constructed from the Y channel of the reference image. This step is efficiently computed using the Laplacian pyramid of Burt and Adelson [Burt83]. The pyramid enables us to compute the spatial frequencies present in the image to determine how sensitivity to contrast changes decreases with increasing frequency. The pyramid is constructed by convolving the luminance Y channel with the separable filter w = [0.05 , 0.25, 0.4, 0.25, 0.05]. G(0) = Y. G(n+1) = convolve G(n) with w horizontally and vertically. The contrast pyramid is computed by C(n) = |G(n) - G(n+1)| / G(n+2). We use at most n_max = log2(min(width, height)) levels in the pyramid. 

Following [Rama99],  we compute the normalized Contrast Sensitivity Function (CSF) using the formula of [Bart89], multiplied by the masking function given in [Daly93] to obtain the combined threshold elevation factor, F. We compute some of the intermediate variables from the field of view (fov) and the image width with the following from [Ward97]: 

num_one_degree_pixels = 2 * tan( fov * 0.5) * 180 / PI

pixels_per_degree = width / num_one_degree_pixels;

cycles_per_degree = 0.5 * pixels_per_degree; 

where fov is the horizontal field of view in radians, width is the number of pixels across the screen. The top level of the Laplacian pyramid, C(0), corresponds to frequencies at cycles_per_degree and each level thereafter is half the frequency of the preceding level. Typical values for fov and width are discussed in the next section. The CSF is computed as a function of cycles per degree (cpd) and luminance (lum): 

/

*

* Contrast Sensitivity Function

* from Barten SPIE 1989

*/

float csf(float cpd, float lum)

{

      // computes the contrast sensitivity function

      // given the cycles per degree (cpd) and luminance (lum)

      float a, b, result; 

      a = 440 * (1 + 0.7 / lum) ^ -0.2

      b = 0.3 * (1 + 100.0 / lum) ^ 0.15

      result = a * cpd * exp(-b * cpd) * sqrt(1 + 0.06 * exp(b * cpd)); 

      return result;

} 

We compute the constant csf_max = csf(3.248, 100.0) because the maximum of the CSF occurs around 3.2 degrees at 100 cd/m^2. 

Then, the elevation factor due to frequency is: 

F_freq[i] = csf_max / csf( cpd[i] , 100) 

where cpd[0] = cycles_per_degree and cpd[i] = 0.5 * cpd[i - 1] 

and i is an integer from 0 to n_max - 1 

 

The masking function from [Daly93] is a function of contrast:

/*

* Visual Masking Function

* from Daly 1993

*/

float mask(contrast)

{

      float a, b, result;

      a = pow(392.498 * contrast,  0.7);

      b = pow(0.0153 * a, 4);

      result = pow (1 + b, 0.25); 

      return result;

} 

Then the elevation factor due to masking is: 

F_mask[i] = mask(C[i] * csf(cpd[i], Y_adapt) ) where 

Y_adapt is the average of pixels in the Y channel in a pixel area of width pixels_per_degree. 

The threshold elevation factor F is computed as: 

F = sum over i = 0 to n_max, C[i] * F_freq[i] * F_mask[i] / sum of all C[i] 

Finally, we perform the following two tests and mark the images as different if any of the following two tests fail. The first test is performed on the luminance channel, Y. If the difference of luminance between two corresponding pixels (x,y) in the reference and test images is deltaY(x,y) = Y1(x,y) - Y2(x,y), then the luminance test fails if: 

deltaY(x,y) > F * TVI( Y_adapt ) 

where TVI is the Threshold vs Intensity function found in {Ward97] and the adaptation luminance, Y_adapt,  is the average of pixels in a one degree radius from the Y channel of the reference image. The TVI is computed as follows: 

/*

* Given the adaptation luminance, this function returns the

* threshold of visibility in cd per m^2

* TVI means Threshold vs Intensity function

* This version comes from Ward Larson Siggraph 1997

*/ 

float TVI(float adaptation_luminance)

{

      // returns the threshold luminance given the adaptation luminance

      // units are candelas per meter squared

      float log_a, r, result; 

      log_a = log10(adaptation_luminance);

      if (log_a < -3.94) {

            r = -2.86;

      } else if (log_a < -1.44) {

            r = (0.405 * log_a + 1.6) ^ 2.18 - 2.86;

      } else if (log_a < -0.0184) {

            r = log_a - 0.395;

      } else if (log_a < 1.9) {

            r = (0.249 * log_a + 0.65) ^ 2.7 - 0.72;

      } else {

            r = log_a - 1.255;

      }

      result = pow(10.0 , r); 

      return result;

} 

The second test is performed on the A and B channels of the reference and test images. The color test fails if: 

(A_ref(x,y) - A_test(x,y))2 + (B_ref(x,y) - B_test(x,y))2 * color_scale2 > F 

color_scale is a scale factor that turns off the color test in the mesopic and scotopic luminance ranges (nigh time light levels) where color vision starts to degrade. We use a value of one for adaptation luminance greater than 10.0 cd / m2. We then ramp color_scale linearly to zero with decreasing adaptation luminance. 

Implementation Details 

There we some implementation details in using the perceptual error metric for Quality Assurance testing of a production renderer that had to be taken into account. 

First of all, the threshold elevation factor, F, depends strongly on the frequency content of the image. This in turn is affected by the viewing parameters of the observer, the most important of which is the field of view. We measured a few cinemas in Hollywood and found out that the average front row and back row field of views were 85 degrees and 27 degrees respectively. Using a field of view of 85 degrees is the most conservative and will increase the probability that the simulated front row observer will not notice differences between the reference and test images. Another important factor is the width of the image in pixels. We use a value of 1827 for film resolution images. The color_scale factor was added because the perceptual metric was returning false positives in very dark areas where the hue does not matter. It uses no rigorous perceptual data other than the fact that the HVS loses its color sensitivity in the mesopic and scotopic ranges. 

Application of Perceptual Metric in Production Testing 

Our studio pipeline is based on a proprietary renderer and a suite of Unix command line utilities. There are also many shaders written in C or C++ which are loaded as dynamic shared objects (DSOs) at run time by the renderer. In order to ensure that code changes do not change the look of previously rendered scenes, the renderer, utilities and shaders are tested by an automated batch process every night. An example of a single test would render a simple scene, such as a teapot, under different lighting conditions. One such test would test, for example, the spot light shader and a noise shader bound to the material of the teapot. A reference frame would be rendered and stored as a canonical image. Then, when any code changes, the new version of the renderer would used to render the scene and the resulting image compared against the canonical. This allows us to catch inadvertent bugs introduced into the code. For example, suppose we change the spot light shader so that shadows are calculated with a faster, optimized technique. This would produce images that are not pixel identical to the canonical but nevertheless correct. The perceptual metric allows us to make such code changes without breaking the automated testing process. 

The batch testing suite produces a mosaic of difference images for every test that fails. This mosaic is simply the concatenation of images that differ from the stored canonical. Using this image, the QA group is able to quickly narrow down problems with the code. Prior to the development of the perceptual metric, so many tests would fail that isolating the cause of failures were almost  impossible. By quickly screening out false positives, only visually perceptible differences make it to the mosaic, speeding up problem identification. 

During the production of Shrek 2, a new compiler was introduced in order to better optimize the renderer. The application of the perceptual metric and automated test suite helped us isolate the few shaders that were broken, perhaps due to numerical instability or failures in compilation. It was essential in helping production switch over to the new software with the assurance that nothing in their existing shots would break. 

When the studio recently switched over to a new operating system, the test suite helped us isolate command line tools that stopped working under the new OS or started behaving differently on the new platform. 

Acknowledgments 

Thanks to Bill Seneschen for the Hollywood Cinema measurements and Paul Rademacher for proof reading. Thanks also to the anonymous reviewers for their valuable feedback. 

Bibliography 

[Bart89] Peter G. J. Barten. The Square Root Integral (SQRI): A new metric to describe the effect of various display parameters on perceived image quality. In Human Vision, Visual Processing, and Digital Display, volume 1077, pages 73-82. Proc. of SPIE, 1989. 

[Burt83] Peter Burt and Edward Adelson. The Laplacian pyramid as a compact image code. In IEEE Transactions on Communications, Vol. Com-31, No. 4, April 1983. 

[Daly93] Scott Daly. The visible differences predictor: an algorithm for the assessment of image fidelity. In Digital Images and Human Vision, pages 179-206, MIT Press, Cambridge, MA, 1993. 

[Glas95]  Andrew S. Glassner. In Principles of digital image synthesis, pages 59-66, Morgan Kaufmann Publishers, San Francisco, CA, 1995. 

[Lubi95]. Jeffrey Lubin. A visual discrimination model for imaging system design and evaluation. In Vision models for target detection and recognition, pages 245-283. World Scientific, New Jersey. 1995. 

[Rama99] Mahesh Ramasubramanian, Sumant N. Pattnaik, Donald P. Greenberg. A perceptually based physical error metric for realistic image synthesis. In SIGGRAPH 99 Conference Proceedings, pages 73-82, Los Angeles, CA, 1999. 

[Ward97] Gregory Ward-Larson,  Holly Rushmeier, and Christine Piatko. A visibility matching tone reproduction operator for high dynamic range scenes. In IEE Transactions on visualization and computer graphics, 3(4):291-306, October 1997.