Unified Shader Architecture – ATI’s Vision
The goal of the developer of modern GPUs is to create a scalable architecture that may be expanded or cut down to create GPUs for different price segments.
The ATI Radeon HD 2000 family is not an exception. Judging by the flowchart of the R600 chip (Radeon HD 2900), the new series resembles the Radeon X1800/X1900 in configuration. The similarity is further extended by the ring-bus memory controller first introduced in October 2005.
The heart of the new chip is a dynamic ultra-threaded dispatch processor that is capable of dispatching, according to ATI, thousands of tasks. This functionality is called for considering that the new dispatcher has to manage much more of computing resources and data types.
Stream processors that execute vertex, geometrical and pixel shaders are organized as 4 SIMD units consisting of 16 shader processors, each of which incorporates 5 scalar ALUs capable of executing one FP MAD instruction per clock cycle (and one ALU out of the five can execute instructions like SIN, COS, LOG, EXP, etc).
Texture processors are located outside the execution pipeline, like in the Radeon X1000 architecture, and are designed as 4 large blocks, each of which includes 8 texture address processors, 4 texture filter units, and 20 texture samplers. None of the blocks has its own cache. They all use unified L1, L2 and vertex caches.
The R600’s render back-ends are 4 rather complex processors capable of performing typical rasterization operations like blending, antialiasing, etc.
Thus, we can identify a few specific aspects of ATI’s approach to designing the Radeon HD 2000 chips:
- Instead of strictly binding the amount of stream processors to the amount of texture processors and memory controller channels as in the Nvidia G80, ATI took a fully differential approach concerning the number of execution devices of this or that type. On one hand, this provides more flexibility since the internal resources of new GPUs can be adjusted without considerable “external” changes in the chip configuration. On the other hand, the Nvidia G80, with its rigidly set number of computing processors, texture-mapping units and raster operators, proves to be a well-balanced solution that can be easily adjusted to the market requirements for the entire lifecycle of the G8x architecture, i.e. to the middle or even the end of 2009.
- “Colossal” texture processors and raster back-ends with rich functionality and a rigidly set number of modules. These may be good for manufacturing powerful GPUs, but may play a bad trick with simpler GPUs if two or three such processors are not fully utilized by the application but make up a considerable share of the transistor budget whereas one proves to be a bottleneck. But implementing flexibly adjustable modules in the texture processors would have made the dynamic dispatcher even more complex, which would be a blow at the transistor budget of mainstream and low-end GPU chips, too.