GeForce 8800 in Detail
Nvidia’s GeForce 8800 sticks even closer to the unification ideology than the ATI Xenos. The heart of the new chip is a universal execution core that consists of 128 separate processors. This core works at a considerably higher clock rate than the rest of the G80’s subunits.
The stream processors are grouped into 8 blocks by 16 processors, each block being equipped with 4 texture modules and a shared L1 cache. A block consists of two shader processors (each of which consists of 8 stream processors), and all the eight blocks have access to any of the six L2 caches and to any of the six arrays of general-purpose registers. Thus, data processed by one shader processor can be used by another shader processor.
An important thing, the above-described design of the shader processors, caches and general-purpose registers allows disabling shader blocks or blocks of L2 cache, general-purpose registers and 64-bit memory controller in case of manufacturing defects to produce “cut-down” solutions to be sold at a lower price.
The data is converted into FP32 format by the Input Assembler. The Thread Processor distributes branches of code and optimizes load on the stream processors.
The GigaThread technology is an advanced analog to Ultra-Threading which ATI employs in its Radeon X1000 series. GigaThread allots shader blocks for processing vertex, geometric and pixel shaders depending on the overall load. Shaders of all types can be executed simultaneously if necessary and if possible. The GigaThread processor also tries to minimize the moments of idleness of the G80’s shader blocks when texture sampling operations are being performed.
Each stream processor can perform two simultaneously issued scalar operations like MAD+MUL per cycle and the overall computing power of the core is, according to Nvidia, about 520 gigaflops. This is over two times that of the ATI R580 whose performance, according to ATI, is about 250 megaflops. We can make one interesting and perhaps arguable observation here. Each pixel processor in the R580 is known to have 2 scalar and 2 vector ALUs and a branch execution unit. So, it can execute up to 4 arithmetic instructions per cycle plus one branch instruction. It seems that the efficiency of one stream processor in the G80 is lower than the efficiency of one pixel processor in the R580, but the overall performance of the G80 is higher because it has more execution units (128 against 48) and clocks them at a higher frequency. Unfortunately, we don’t have any data about the design of an individual stream processor in the G80. We only know that it is fully scalar as opposed to the pixel processor of the last-generation architectures which contains both scalar and vector ALUs.
Each of the G80’s 128 stream processors is an ordinary ALU capable of processing data in floating-point format. It means that a stream processor can not only work with shaders of any type (vertex, pixel, geometric) but also process the physical model or perform other computations in the framework of the Compute Unified Device Architecture (CUDA). And it does that independently of the other processors. In other words, one part of the GeForce 8800 can be involved in some kind of computations while the other, for example, be busy visualizing the results of those computations because the streaming architecture allows using the output of one processor as the input for another processor.
The GPU efficiency at processing shaders with dynamic branching has been improved in comparison with the ATI Radeon X1900. The latter can process 48-pixel large branches whereas the GeForce 8800, from 16 to 32 pixels large. We can check out how efficient the execution of branching pixel shaders has become and will do this in the theoretical tests section.