Stream Processors and Texture Units
The general principle of the computing section has not changed much in the RV870. It is still based on shader processors with superscalar design, each processor incorporating five ALUs four of which are general-purpose ALUs and the fifth is a special-purpose ALU capable of executing complex instructions like SIN, COS, LOG, EXP, etc. Besides the ALUs, each shader processor also contains a branch control unit and an array of general-purpose registers.
When we are talking about 1600 stream processors in the RV870, we must keep it in mind that there are actually 320 rather complex 5-stage computing subunits. Provided sufficient code optimization, this design of the GPU’s computing section helps achieve a much higher level of performance than with Nvidia’s scalar architecture. The design of the shader processors and task scheduler has been improved in the new GPU to support the new DirectX 11/DirectCompute 11 capabilities.
As before, the RV870’s shader processors are grouped into SIMD cores with 16 processors in each core, but there are now 20 rather than 10 such cores in the GPU.
Each core is serviced by dedicated logic and has four texture processors and a L1 cache at its disposal. Thus, the amount of texture processors in the RV870 is doubled (from 40 to 80 TMUs). The peak texture sampling performance has doubled, too. The overall architecture of texture processors seems to have been left largely intact. Each of them still consists of 16 FP32 texture fetch units, four address units and four filters. However, with the introduction of DirectX 11 support these processors have acquired support for 16Kx16K-pixel textures, new HDR texture compression modes, Gather technology for nearest textures sampling acceleration, etc. There is also a new anisotropic filtering algorithm that delivers the same high filtering quality irrespective of the angle of inclination of the filtered surface.
The computing cores can communicate on both local and global levels. ATI claims a considerable increase in cache bandwidth. Particularly, the speed of fetching data from the L1 cache is now as high as 1 terabyte per second while the bandwidth of the link between the L1 and L2 caches is increased to 435GBps. The L2 caches have become larger from 64 to 128KB. The ratio of computing to texture-mapping resources has not changed and is still 4 to 1.
According to the developer, the peak computing power of the RV870 is as high as 2.7 teraflops in single-precision mode (FP32) and 544 gigaflops in double precision mode (FP64) which is used for most serious computing tasks. A special mention must be made of the ability to execute threads in protected memory sections which makes it easier to transfer code originally developed for the classic CPU to the GPGPU platform. All these innovations in the RV870’s computing section make it a perfect choice for GPGPU, especially in comparison with Nvidia’s solutions whose double-precision performance is far from ideal.