A Perceptual 
Metric for Production Testing (Submitted and Accepted in Journal of Graphics Tools, 2004) 
Hector Yee
Abstract 
This paper describes a perceptually 
based image comparison process that can be used to tell when images 
are perceptually identical even though they contain imperceptible numerical 
differences. The technique has shown much utility in the production 
testing of rendering software. 
Introduction 
Rendering software in a production 
environment needs to be stable and bug free. Entire movie production 
pipelines depend on the software to function correctly without crashing. 
In order to ensure such stability, rigorous testing of the software 
is needed whenever the code changes. Testing requires frequent re-rendering 
of canonical test scenes to ensure that no bugs are introduced in the 
development process. The images from the test scenes are then compared 
with previously generated reference images to ensure that no significant 
changes are visible. A pixel by pixel comparison between reference and 
test image does not always work because a small change in code might 
result in tiny pixel level misalignments or intensity changes that are 
not visible but nevertheless produces pixels that are not numerically 
identical. 
There are many sources of imperceptible 
changes. One major source of change comes from sampling. For example, 
if the anti-aliasing scheme were to change from one version of the renderer 
to the next, the pixels on the edges of objects might have slightly 
different values. The same thing might happen in the shadows of objects 
as soft shadow algorithms evolve over time. These imperceptible changes 
induce many false positives (bugs which are not actual bugs) in the 
rendering tests. Having a perceptually based error metric allows us 
to weed out many false positives in the rendering test suite without 
having to constantly update the reference images. 
Previous Work 
Daly proposed the Visible Differences 
Predictor (VDP) in [Daly93] for predicting the probability of detection 
of differences between two images. The Sarnoff Visual Discrimination 
Model [Lubi95] is another popular image metric. We will focus on the 
VDP as it is the closest to our model. The VDP gives the per-pixel probability 
of detection given two images to be compared and a set of viewing conditions 
the images are seen under. Daly’s model takes into account three factors 
of the Human Visual System (HVS) that reduce the sensitivity to error. 
The first, amplitude non-linearity, notes that the sensitivity of the 
HVS to contrast changes decreases with increasing light levels. This 
means that humans are more likely to notice a change of a particular 
magnitude at low light conditions than the same change in a brighter 
environment. Note that the HVS is sensitive to relative rather than 
absolute changes in luminance. Secondly, the sensitivity to changes 
decreases with increasing spatial frequency. For example, a needle is 
harder to spot in a haystack than on a white piece of paper due to the 
higher spatial frequency of the haystack. Finally, the last effect, 
masking, takes into account the variations in sensitivity due to the 
signal content of the background. 
We use an abridged version 
of the VDP in the same way as Ramasubramanian et. al. [Rama99], in that 
we drop the orientation computation when calculating spatial frequencies. 
We also extend [Rama99] by including the color domain in computing the 
differences. By using an abridged VDP, we gain some speed increase over 
the full VDP, which is essential when testing thousands of images at 
film resolution. Also, a user specified field of view lets us control 
the sensitivity of the error metric. For example, we might specify a 
front row theater viewer for a very conservative metric.  
Implementation 
Assuming that the reference 
image and the image to be compared are in the RGB color  space, 
the first step will be to convert the images into XYZ and CIE-L*A*B 
space. XYZ is a color space where Y represents the luminance of a pixel 
and X, Z are color coordinates. CIE-L*A*B is a color space designed 
to be perceptually uniform, where the Euclidean distance between two 
colors corresponds to perceptual distance. L also represents luminance 
and A, B are color coordinates This conversion step is described in 
[Glas95]. 
The following steps compute 
the threshold elevation factor F, or how much tolerance to error is 
increased, similar to that found in [Rama99]. A spatial frequency hierarchy 
is constructed from the Y channel of the reference image. This step 
is efficiently computed using the Laplacian pyramid of Burt and Adelson 
[Burt83]. The pyramid enables us to compute the spatial frequencies 
present in the image to determine how sensitivity to contrast changes 
decreases with increasing frequency. The pyramid is constructed by convolving 
the luminance Y channel with the separable filter w = [0.05 , 0.25, 
0.4, 0.25, 0.05]. G(0) = Y. G(n+1) = convolve G(n) with w horizontally 
and vertically. The contrast pyramid is computed by C(n) = |G(n) - G(n+1)| 
/ G(n+2). We use at most n_max = log2(min(width, height)) levels in 
the pyramid. 
Following [Rama99],  we 
compute the normalized Contrast Sensitivity Function (CSF) using the 
formula of [Bart89], multiplied by the masking function given in [Daly93] 
to obtain the combined threshold elevation factor, F. We compute some 
of the intermediate variables from the field of view (fov) and the image 
width with the following from [Ward97]: 
num_one_degree_pixels = 2 * tan( fov * 0.5) * 180 / PI
pixels_per_degree = width / num_one_degree_pixels;
cycles_per_degree = 0.5 * pixels_per_degree; 
where fov is the horizontal 
field of view in radians, width is the number of pixels across the screen. 
The top level of the Laplacian pyramid, C(0), corresponds to frequencies 
at cycles_per_degree and each level thereafter is half the frequency 
of the preceding level. Typical values for fov and width are discussed 
in the next section. The CSF is computed as a function of cycles per 
degree (cpd) and luminance (lum): 
/
*
* Contrast Sensitivity Function
* from Barten SPIE 1989
*/
float csf(float cpd, float lum)
{
// computes the contrast sensitivity function
// given the cycles per degree (cpd) and luminance (lum)
      float 
a, b, result; 
a = 440 * (1 + 0.7 / lum) ^ -0.2
b = 0.3 * (1 + 100.0 / lum) ^ 0.15
      result 
= a * cpd * exp(-b * cpd) * sqrt(1 + 0.06 * exp(b * cpd)); 
return result;
} 
We compute the constant csf_max 
= csf(3.248, 100.0) because the maximum of the CSF occurs around 3.2 
degrees at 100 cd/m^2. 
Then, the elevation factor 
due to frequency is: 
F_freq[i] = csf_max / csf( 
cpd[i] , 100) 
where cpd[0] = cycles_per_degree 
and cpd[i] = 0.5 * cpd[i - 1] 
and i is an integer from 0 
to n_max - 1 
 
The masking function from [Daly93] is a function of contrast:
/*
* Visual Masking Function
* from Daly 1993
*/
float mask(contrast)
{
float a, b, result;
a = pow(392.498 * contrast, 0.7);
b = pow(0.0153 * a, 4);
      result 
= pow (1 + b, 0.25); 
return result;
} 
Then the elevation factor due 
to masking is: 
F_mask[i] = mask(C[i] * csf(cpd[i], 
Y_adapt) ) where 
Y_adapt is the average of pixels 
in the Y channel in a pixel area of width pixels_per_degree. 
The threshold elevation factor 
F is computed as: 
F = sum over i = 0 to n_max, 
C[i] * F_freq[i] * F_mask[i] / sum of all C[i] 
Finally, we perform the following 
two tests and mark the images as different if any of the following two 
tests fail. The first test is performed on the luminance channel, Y. 
If the difference of luminance between two corresponding pixels (x,y) 
in the reference and test images is deltaY(x,y) = Y1(x,y) - Y2(x,y), 
then the luminance test fails if: 
deltaY(x,y) > F * TVI( Y_adapt 
) 
where TVI is the Threshold 
vs Intensity function found in {Ward97] and the adaptation luminance, 
Y_adapt,  is the average of pixels in a one degree radius from 
the Y channel of the reference image. The TVI is computed as follows: 
/*
* Given the adaptation luminance, this function returns the
* threshold of visibility in cd per m^2
* TVI means Threshold vs Intensity function
* This version comes from Ward Larson Siggraph 1997
 */ 
float TVI(float adaptation_luminance)
{
// returns the threshold luminance given the adaptation luminance
// units are candelas per meter squared
      float 
log_a, r, result; 
log_a = log10(adaptation_luminance);
if (log_a < -3.94) {
r = -2.86;
} else if (log_a < -1.44) {
r = (0.405 * log_a + 1.6) ^ 2.18 - 2.86;
} else if (log_a < -0.0184) {
r = log_a - 0.395;
} else if (log_a < 1.9) {
r = (0.249 * log_a + 0.65) ^ 2.7 - 0.72;
} else {
r = log_a - 1.255;
}
      result 
= pow(10.0 , r); 
return result;
} 
The second test is performed 
on the A and B channels of the reference and test images. The color 
test fails if: 
(A_ref(x,y) - A_test(x,y))2 
+ (B_ref(x,y) - B_test(x,y))2 * color_scale2 > F 
color_scale is a scale factor 
that turns off the color test in the mesopic and scotopic luminance 
ranges (nigh time light levels) where color vision starts to degrade. 
We use a value of one for adaptation luminance greater than 10.0 cd 
/ m2. We then ramp color_scale linearly 
to zero with decreasing adaptation luminance. 
Implementation Details 
There we some implementation 
details in using the perceptual error metric for Quality Assurance testing 
of a production renderer that had to be taken into account. 
First of all, the threshold 
elevation factor, F, depends strongly on the frequency content of the 
image. This in turn is affected by the viewing parameters of the observer, 
the most important of which is the field of view. We measured a few 
cinemas in Hollywood and found out that the average front row and back 
row field of views were 85 degrees and 27 degrees respectively. Using 
a field of view of 85 degrees is the most conservative and will increase 
the probability that the simulated front row observer will not notice 
differences between the reference and test images. Another important 
factor is the width of the image in pixels. We use a value of 1827 for 
film resolution images. The color_scale factor was added because the 
perceptual metric was returning false positives in very dark areas where 
the hue does not matter. It uses no rigorous perceptual data other than 
the fact that the HVS loses its color sensitivity in the mesopic and 
scotopic ranges. 
Application of Perceptual 
Metric in Production Testing 
Our studio pipeline is based 
on a proprietary renderer and a suite of Unix command line utilities. 
There are also many shaders written in C or C++ which are loaded as 
dynamic shared objects (DSOs) at run time by the renderer. In order 
to ensure that code changes do not change the look of previously rendered 
scenes, the renderer, utilities and shaders are tested by an automated 
batch process every night. An example of a single test would render 
a simple scene, such as a teapot, under different lighting conditions. 
One such test would test, for example, the spot light shader and a noise 
shader bound to the material of the teapot. A reference frame would 
be rendered and stored as a canonical image. Then, when any code changes, 
the new version of the renderer would used to render the scene and the 
resulting image compared against the canonical. This allows us to catch 
inadvertent bugs introduced into the code. For example, suppose we change 
the spot light shader so that shadows are calculated with a faster, 
optimized technique. This would produce images that are not pixel identical 
to the canonical but nevertheless correct. The perceptual metric allows 
us to make such code changes without breaking the automated testing 
process. 
The batch testing suite produces 
a mosaic of difference images for every test that fails. This mosaic 
is simply the concatenation of images that differ from the stored canonical. 
Using this image, the QA group is able to quickly narrow down problems 
with the code. Prior to the development of the perceptual metric, so 
many tests would fail that isolating the cause of failures were almost  
impossible. By quickly screening out false positives, only visually 
perceptible differences make it to the mosaic, speeding up problem identification. 
During the production of Shrek 
2, a new compiler was introduced in order to better optimize the renderer. 
The application of the perceptual metric and automated test suite helped 
us isolate the few shaders that were broken, perhaps due to numerical 
instability or failures in compilation. It was essential in helping 
production switch over to the new software with the assurance that nothing 
in their existing shots would break. 
When the studio recently switched 
over to a new operating system, the test suite helped us isolate command 
line tools that stopped working under the new OS or started behaving 
differently on the new platform. 
Acknowledgments 
Thanks to Bill Seneschen for 
the Hollywood Cinema measurements and Paul Rademacher for proof reading. 
Thanks also to the anonymous reviewers for their valuable feedback. 
Bibliography 
[Bart89] Peter G. J. Barten. 
The Square Root Integral (SQRI): A new metric to describe the effect 
of various display parameters on perceived image quality. In Human 
Vision, Visual Processing, and Digital Display, volume 1077, pages 
73-82. Proc. of SPIE, 1989. 
[Burt83] Peter Burt and Edward 
Adelson. The Laplacian pyramid as a compact image code. In IEEE Transactions 
on Communications, Vol. Com-31, No. 4, April 1983. 
[Daly93] Scott Daly. The visible 
differences predictor: an algorithm for the assessment of image fidelity. 
In Digital Images and Human Vision, pages 179-206, MIT Press, 
Cambridge, MA, 1993. 
[Glas95]  Andrew S. Glassner. 
In Principles of digital image synthesis, pages 59-66, Morgan 
Kaufmann Publishers, San Francisco, CA, 1995. 
[Lubi95]. Jeffrey Lubin. A 
visual discrimination model for imaging system design and evaluation. 
In Vision models for target detection and recognition, pages 
245-283. World Scientific, New Jersey. 1995. 
[Rama99] Mahesh Ramasubramanian, 
Sumant N. Pattnaik, Donald P. Greenberg. A perceptually based physical 
error metric for realistic image synthesis. In SIGGRAPH 99 Conference 
Proceedings, pages 73-82, Los Angeles, CA, 1999. 
[Ward97] Gregory Ward-Larson, Holly Rushmeier, and Christine Piatko. A visibility matching tone reproduction operator for high dynamic range scenes. In IEE Transactions on visualization and computer graphics, 3(4):291-306, October 1997.