ATI focused on distributing the load among the various functional units of the chip, so the new RADEON X1000 architecture is a multi-threaded or, in ATI’s terms, Ultra-Threaded Architecture. The name of the technology sounds like Intel’s Hyper-Threading and its purpose is similar, too: to use the available computational resources of the processor in the most efficient way and to minimize the time when the processor’s execution devices are idle.
The RADEON X1000 (R5xx) architecture has some points of similarity with both RADEON 9000 (R3xx) and RADEON X800 (R4xx) architectures as well as with the completely new architecture employed in the Xbox 360 GPU, but ATI’s new processors have a number of unique traits that have no analogs in the other chips.
For example, the RADEON X1000 GPUs have an integrated intelligent switching unit, the so-called Ultra-Threading Dispatch Processor, which is to optimally distribute the load among the quads of pixel processors (each quad consists of four pixel processors, each of which can process a shader for a 2x2-pixel block in a single clock cycle) and the texture-mapping units. Particularly, the Ultra-Threading Dispatch Processor divides the pixel processing workload into small threads of 4x4 pixels. It can also determine the moments of idleness of some pixel processors in the quads and assign them new tasks. When further execution of the shader requires some not yet ready data, the arbiter processor halts the thread until the data is received thus freeing the ALUs for other threads and masking the texture sampling latency, for example, for textures stored either in cache or memory. According to ATI, this architecture helps to achieve a 90% efficiency of the pixel processors on any shader.
Quick switching between the threads requires storing the intermediate data of each thread, and ATI uses special registers (General Purpose Register Array) connected at high speed with the pixel processors as in earlier ATI’s GPUs. It’s not quite clear yet how many registers there are in the RADEON X1800, X1600 and X1300, and how sensitive the GPUs are to the degree of complexity of pixel shaders.
Complying fully with the Shader Model 3.0 standard, ATI’s new solutions fully support loops, branches and subroutines. The flow control helps them execute virtually infinite shaders. The RADEON X1000 family processors do all executions in 128-bit floating-point format which minimizes the possibility that round-off errors accumulate and worsen the image quality.
The number of simultaneously executed code threads has become bigger, but the size of each thread has been reduced to 4x4 pixels. This helps to achieve a higher efficiency at dynamic branching as illustrated by the next diagram:
The advantages of ATI’s approach are obvious: the dynamic branching efficiency degenerates greatly at big thread sizes and it becomes downright unprofitable with 64x64-pixel threads. The senior model, RADEON X1800 (R520), can execute up to 512 threads of shader code simultaneously while the weaker models are limited to 128 simultaneous threads.
A special dedicated branch execution unit is another interesting feature of the RADEON X1000. Executing one flow control instruction (conditions, loops, subroutines) per clock cycle this unit greatly reduces the load on the main ALUs. Shaders that use flow control instructions are executed in fewer cycles than usually. This may bring a considerable performance increase with version 3.0 pixel shaders over NVIDIA’s solutions.
Since contemporary games make wide use of pixel shaders, ATI put an emphasis on high pixel shader performance of the new GPUs. As you remember, the pixel pipelines of the GeForce 7 were also improved since NVIDIA’s earlier GPUs.
The goal was achieved by increasing the number of ALUs. Each pixel processor of the R520 has 2 scalar and 2 vector ALUs, capable of executing up to 4 instructions per clock cycle (2 ADD-type instructions + modifier, 2 ADD/MUL/MADD-type instructions).
The new RADEON is also the first GPU in which the texture-mapping units and the texture addressing units communicate with the shader processors not directly, but through the Ultra-Threading Dispatch Processor. This must have been done as another optimization measure for the whole graphics core, mostly to hide the texture sampling latency. It’s just more efficient to coordinate all the units from a single “control center”!
ATI Technologies says the total performance of the RADEON X1800 (R520) equals 83Gflops whereas NVIDIA claims 165Gflops for the G70. This is two times higher than the performance of the ATI chip, but the comparison is probably incorrect. The speed of the GeForce 7800 GTX was measured on MADD instructions, and we don’t know how ATI measured the performance of their card.