GeForce 6800/6800 Ultra Rendering Pipeline
You can’t talk about traditional pixel shaders with regard to the NV40 core and the CineFX 3.0 engine. Starting from the NV30 and CineFX, graphics processors from NVIDIA have one wide pixel pipeline with several pixels at its different stages at a time, rather than several independent pixel pipelines. The NV40 inherits the NV3x architecture, but with major improvements and additions.
The NV40 has a four times wider pixel pipeline than the NV35 had. Now there can be as many as 16 pixels in the pipeline simultaneously and the maximum output rate is 16 pixels per clock cycle. When more than one texture is mapped, the pixel output rate is lower. For example, 8 pixels per clock cycle when dealing with two textures. The pixel pipeline of the NV40, like that of the NV35, accelerates when working with the Z-buffer or the stencil buffer. The overall four-fold performance increase is felt here, too. Now the graphics processor can output 32 Z-values per clock cycle. Thus, the NV40 is simply boiling with raw power – we can speak about a quadruple fill-rate increase compared to the NV35.
Expanding the pixel pipeline, NVIDIA also improved the computational capability of the pixel processor. Firstly, the number of available temporary registers – the weak side of pixel processors in GeForce FX chips because of the peculiarities of the pixel pipeline architecture – has been increased. Complex pixel shaders calculated with full 32-bit precision shouldn’t now be an unbearable load for the GPU.
Secondly, the number of fully-operational ALUs for processing pixel components seems to be doubled in the NV40. To be precise, two types of floating-point ALUs from the NV35, “fully-operational” and “simplified”, which came instead of the integer ALUs of the NV30, have all transformed into “fully-operational” ALUs that perform operations of any complexity at the same speed. This is how NVIDIA shows the advantages of CineFX 3.0 (right) with the double number of ALUs over “traditional” architectures (left):
Thus, ALUs of the NV40 can perform up to eight operations on the components of one pixel in a clock cycle. Considering that the NV40 pipeline processes 16 pixels at a time, the NV40 core has 32 floating-point ALUs that perform up to 128 operations per clock cycle.
Overall, NV40 seems to be an evolutionary development of the NV3x series, with considerable improvements. Besides “quantitative” improvements, there are “qualitative” ones, too.
Pixel Shaders Version 3.0
The support of pixel shader version 3.0 implemented in NV40 implies the support of dynamic loops and branching in the first hand. Now the decision about which branch should be executed is taken right during the actual execution – the variables that determine the flow of the shader may vary. They are not constants anymore like before when we had static branches and loops.
Clearly, the new functionality of the NV40 won’t show itself in running shaders version 2.0 where only speed characteristics of the pixel processor matter (hopefully, the processor is fast enough, so that we didn’t see slow shader processing of the NV3x again).
There is one problem that can arise with version 3.0 shaders that use dynamic loops and branches. Processing several pixels, the NV40 may encounter a situation when it must execute one branch of the shader for some pixels and another branch for other pixels. How does the pipeline work in this case?
Possible solutions always hit on the performance. For example, if the pipeline meets a branch and starts processing pixels one by one, rather than several pixels at a time, execution units will mostly be idle. Contrary, if both branches of the shader are executed for all pixels, additional computational expenses arise.
We tried to estimate the performance of the NVIDIA GeForce 6800/6800 Ultra when it executes pixel shaders version 3.0 using our own benchmark, however, we did not succeed for some reason. Microsoft DirectX 9 API supports pixel shaders 3.0 since its first release, but the GeForce 6800 Ultra did not run the shader we offered it. We believe, the Shader 3.0 functionality is at least partially disabled in the current driver.