A Perceptual
Metric for Production Testing (Submitted and Accepted in Journal of Graphics Tools, 2004)
Hector Yee
Abstract
This paper describes a perceptually
based image comparison process that can be used to tell when images
are perceptually identical even though they contain imperceptible numerical
differences. The technique has shown much utility in the production
testing of rendering software.
Introduction
Rendering software in a production
environment needs to be stable and bug free. Entire movie production
pipelines depend on the software to function correctly without crashing.
In order to ensure such stability, rigorous testing of the software
is needed whenever the code changes. Testing requires frequent re-rendering
of canonical test scenes to ensure that no bugs are introduced in the
development process. The images from the test scenes are then compared
with previously generated reference images to ensure that no significant
changes are visible. A pixel by pixel comparison between reference and
test image does not always work because a small change in code might
result in tiny pixel level misalignments or intensity changes that are
not visible but nevertheless produces pixels that are not numerically
identical.
There are many sources of imperceptible
changes. One major source of change comes from sampling. For example,
if the anti-aliasing scheme were to change from one version of the renderer
to the next, the pixels on the edges of objects might have slightly
different values. The same thing might happen in the shadows of objects
as soft shadow algorithms evolve over time. These imperceptible changes
induce many false positives (bugs which are not actual bugs) in the
rendering tests. Having a perceptually based error metric allows us
to weed out many false positives in the rendering test suite without
having to constantly update the reference images.
Previous Work
Daly proposed the Visible Differences
Predictor (VDP) in [Daly93] for predicting the probability of detection
of differences between two images. The Sarnoff Visual Discrimination
Model [Lubi95] is another popular image metric. We will focus on the
VDP as it is the closest to our model. The VDP gives the per-pixel probability
of detection given two images to be compared and a set of viewing conditions
the images are seen under. Daly’s model takes into account three factors
of the Human Visual System (HVS) that reduce the sensitivity to error.
The first, amplitude non-linearity, notes that the sensitivity of the
HVS to contrast changes decreases with increasing light levels. This
means that humans are more likely to notice a change of a particular
magnitude at low light conditions than the same change in a brighter
environment. Note that the HVS is sensitive to relative rather than
absolute changes in luminance. Secondly, the sensitivity to changes
decreases with increasing spatial frequency. For example, a needle is
harder to spot in a haystack than on a white piece of paper due to the
higher spatial frequency of the haystack. Finally, the last effect,
masking, takes into account the variations in sensitivity due to the
signal content of the background.
We use an abridged version
of the VDP in the same way as Ramasubramanian et. al. [Rama99], in that
we drop the orientation computation when calculating spatial frequencies.
We also extend [Rama99] by including the color domain in computing the
differences. By using an abridged VDP, we gain some speed increase over
the full VDP, which is essential when testing thousands of images at
film resolution. Also, a user specified field of view lets us control
the sensitivity of the error metric. For example, we might specify a
front row theater viewer for a very conservative metric.
Implementation
Assuming that the reference
image and the image to be compared are in the RGB color space,
the first step will be to convert the images into XYZ and CIE-L*A*B
space. XYZ is a color space where Y represents the luminance of a pixel
and X, Z are color coordinates. CIE-L*A*B is a color space designed
to be perceptually uniform, where the Euclidean distance between two
colors corresponds to perceptual distance. L also represents luminance
and A, B are color coordinates This conversion step is described in
[Glas95].
The following steps compute
the threshold elevation factor F, or how much tolerance to error is
increased, similar to that found in [Rama99]. A spatial frequency hierarchy
is constructed from the Y channel of the reference image. This step
is efficiently computed using the Laplacian pyramid of Burt and Adelson
[Burt83]. The pyramid enables us to compute the spatial frequencies
present in the image to determine how sensitivity to contrast changes
decreases with increasing frequency. The pyramid is constructed by convolving
the luminance Y channel with the separable filter w = [0.05 , 0.25,
0.4, 0.25, 0.05]. G(0) = Y. G(n+1) = convolve G(n) with w horizontally
and vertically. The contrast pyramid is computed by C(n) = |G(n) - G(n+1)|
/ G(n+2). We use at most n_max = log2(min(width, height)) levels in
the pyramid.
Following [Rama99], we
compute the normalized Contrast Sensitivity Function (CSF) using the
formula of [Bart89], multiplied by the masking function given in [Daly93]
to obtain the combined threshold elevation factor, F. We compute some
of the intermediate variables from the field of view (fov) and the image
width with the following from [Ward97]:
num_one_degree_pixels = 2 * tan( fov * 0.5) * 180 / PI
pixels_per_degree = width / num_one_degree_pixels;
cycles_per_degree = 0.5 * pixels_per_degree;
where fov is the horizontal
field of view in radians, width is the number of pixels across the screen.
The top level of the Laplacian pyramid, C(0), corresponds to frequencies
at cycles_per_degree and each level thereafter is half the frequency
of the preceding level. Typical values for fov and width are discussed
in the next section. The CSF is computed as a function of cycles per
degree (cpd) and luminance (lum):
/
*
* Contrast Sensitivity Function
* from Barten SPIE 1989
*/
float csf(float cpd, float lum)
{
// computes the contrast sensitivity function
// given the cycles per degree (cpd) and luminance (lum)
float
a, b, result;
a = 440 * (1 + 0.7 / lum) ^ -0.2
b = 0.3 * (1 + 100.0 / lum) ^ 0.15
result
= a * cpd * exp(-b * cpd) * sqrt(1 + 0.06 * exp(b * cpd));
return result;
}
We compute the constant csf_max
= csf(3.248, 100.0) because the maximum of the CSF occurs around 3.2
degrees at 100 cd/m^2.
Then, the elevation factor
due to frequency is:
F_freq[i] = csf_max / csf(
cpd[i] , 100)
where cpd[0] = cycles_per_degree
and cpd[i] = 0.5 * cpd[i - 1]
and i is an integer from 0
to n_max - 1
The masking function from [Daly93] is a function of contrast:
/*
* Visual Masking Function
* from Daly 1993
*/
float mask(contrast)
{
float a, b, result;
a = pow(392.498 * contrast, 0.7);
b = pow(0.0153 * a, 4);
result
= pow (1 + b, 0.25);
return result;
}
Then the elevation factor due
to masking is:
F_mask[i] = mask(C[i] * csf(cpd[i],
Y_adapt) ) where
Y_adapt is the average of pixels
in the Y channel in a pixel area of width pixels_per_degree.
The threshold elevation factor
F is computed as:
F = sum over i = 0 to n_max,
C[i] * F_freq[i] * F_mask[i] / sum of all C[i]
Finally, we perform the following
two tests and mark the images as different if any of the following two
tests fail. The first test is performed on the luminance channel, Y.
If the difference of luminance between two corresponding pixels (x,y)
in the reference and test images is deltaY(x,y) = Y1(x,y) - Y2(x,y),
then the luminance test fails if:
deltaY(x,y) > F * TVI( Y_adapt
)
where TVI is the Threshold
vs Intensity function found in {Ward97] and the adaptation luminance,
Y_adapt, is the average of pixels in a one degree radius from
the Y channel of the reference image. The TVI is computed as follows:
/*
* Given the adaptation luminance, this function returns the
* threshold of visibility in cd per m^2
* TVI means Threshold vs Intensity function
* This version comes from Ward Larson Siggraph 1997
*/
float TVI(float adaptation_luminance)
{
// returns the threshold luminance given the adaptation luminance
// units are candelas per meter squared
float
log_a, r, result;
log_a = log10(adaptation_luminance);
if (log_a < -3.94) {
r = -2.86;
} else if (log_a < -1.44) {
r = (0.405 * log_a + 1.6) ^ 2.18 - 2.86;
} else if (log_a < -0.0184) {
r = log_a - 0.395;
} else if (log_a < 1.9) {
r = (0.249 * log_a + 0.65) ^ 2.7 - 0.72;
} else {
r = log_a - 1.255;
}
result
= pow(10.0 , r);
return result;
}
The second test is performed
on the A and B channels of the reference and test images. The color
test fails if:
(A_ref(x,y) - A_test(x,y))2
+ (B_ref(x,y) - B_test(x,y))2 * color_scale2 > F
color_scale is a scale factor
that turns off the color test in the mesopic and scotopic luminance
ranges (nigh time light levels) where color vision starts to degrade.
We use a value of one for adaptation luminance greater than 10.0 cd
/ m2. We then ramp color_scale linearly
to zero with decreasing adaptation luminance.
Implementation Details
There we some implementation
details in using the perceptual error metric for Quality Assurance testing
of a production renderer that had to be taken into account.
First of all, the threshold
elevation factor, F, depends strongly on the frequency content of the
image. This in turn is affected by the viewing parameters of the observer,
the most important of which is the field of view. We measured a few
cinemas in Hollywood and found out that the average front row and back
row field of views were 85 degrees and 27 degrees respectively. Using
a field of view of 85 degrees is the most conservative and will increase
the probability that the simulated front row observer will not notice
differences between the reference and test images. Another important
factor is the width of the image in pixels. We use a value of 1827 for
film resolution images. The color_scale factor was added because the
perceptual metric was returning false positives in very dark areas where
the hue does not matter. It uses no rigorous perceptual data other than
the fact that the HVS loses its color sensitivity in the mesopic and
scotopic ranges.
Application of Perceptual
Metric in Production Testing
Our studio pipeline is based
on a proprietary renderer and a suite of Unix command line utilities.
There are also many shaders written in C or C++ which are loaded as
dynamic shared objects (DSOs) at run time by the renderer. In order
to ensure that code changes do not change the look of previously rendered
scenes, the renderer, utilities and shaders are tested by an automated
batch process every night. An example of a single test would render
a simple scene, such as a teapot, under different lighting conditions.
One such test would test, for example, the spot light shader and a noise
shader bound to the material of the teapot. A reference frame would
be rendered and stored as a canonical image. Then, when any code changes,
the new version of the renderer would used to render the scene and the
resulting image compared against the canonical. This allows us to catch
inadvertent bugs introduced into the code. For example, suppose we change
the spot light shader so that shadows are calculated with a faster,
optimized technique. This would produce images that are not pixel identical
to the canonical but nevertheless correct. The perceptual metric allows
us to make such code changes without breaking the automated testing
process.
The batch testing suite produces
a mosaic of difference images for every test that fails. This mosaic
is simply the concatenation of images that differ from the stored canonical.
Using this image, the QA group is able to quickly narrow down problems
with the code. Prior to the development of the perceptual metric, so
many tests would fail that isolating the cause of failures were almost
impossible. By quickly screening out false positives, only visually
perceptible differences make it to the mosaic, speeding up problem identification.
During the production of Shrek
2, a new compiler was introduced in order to better optimize the renderer.
The application of the perceptual metric and automated test suite helped
us isolate the few shaders that were broken, perhaps due to numerical
instability or failures in compilation. It was essential in helping
production switch over to the new software with the assurance that nothing
in their existing shots would break.
When the studio recently switched
over to a new operating system, the test suite helped us isolate command
line tools that stopped working under the new OS or started behaving
differently on the new platform.
Acknowledgments
Thanks to Bill Seneschen for
the Hollywood Cinema measurements and Paul Rademacher for proof reading.
Thanks also to the anonymous reviewers for their valuable feedback.
Bibliography
[Bart89] Peter G. J. Barten.
The Square Root Integral (SQRI): A new metric to describe the effect
of various display parameters on perceived image quality. In Human
Vision, Visual Processing, and Digital Display, volume 1077, pages
73-82. Proc. of SPIE, 1989.
[Burt83] Peter Burt and Edward
Adelson. The Laplacian pyramid as a compact image code. In IEEE Transactions
on Communications, Vol. Com-31, No. 4, April 1983.
[Daly93] Scott Daly. The visible
differences predictor: an algorithm for the assessment of image fidelity.
In Digital Images and Human Vision, pages 179-206, MIT Press,
Cambridge, MA, 1993.
[Glas95] Andrew S. Glassner.
In Principles of digital image synthesis, pages 59-66, Morgan
Kaufmann Publishers, San Francisco, CA, 1995.
[Lubi95]. Jeffrey Lubin. A
visual discrimination model for imaging system design and evaluation.
In Vision models for target detection and recognition, pages
245-283. World Scientific, New Jersey. 1995.
[Rama99] Mahesh Ramasubramanian,
Sumant N. Pattnaik, Donald P. Greenberg. A perceptually based physical
error metric for realistic image synthesis. In SIGGRAPH 99 Conference
Proceedings, pages 73-82, Los Angeles, CA, 1999.
[Ward97] Gregory Ward-Larson, Holly Rushmeier, and Christine Piatko. A visibility matching tone reproduction operator for high dynamic range scenes. In IEE Transactions on visualization and computer graphics, 3(4):291-306, October 1997.