Second Generation DirectX 11 Tessellator
DirectX 11 made tessellation a standard feature but, even though the Radeon HD 5000 series complied with all the technical requirements of the new API, its tessellation performance was poor. We might even say that the Radeon HD 5000 cards offered but formal support for that feature. This was not a problem as long as Nvidia didn’t have DirectX 11 solutions at all, especially as there were almost no games utilizing tessellation. The release of the Fermi architecture changed the situation because Fermi-based cards were much faster than their Radeon HD opponents at processing geometry as we could see in the Stone Giant and Unigine Heaven benchmarks as well as in Metro 2033.
Tessellation had used to be an interesting but nonstandard and rarely employed opportunity until DirectX 11. But after it had become an industry standard, AMD had to work hard to improve its tessellation unit in the new Radeon HD generation and match Nvidia’s solutions in that respect.
AMD’s tessellation technology is eight generations old already, but we can disregard the six generations before DirectX 11 as they didn’t ever get much recognition from game developers. Thus, the Barts features a second-generation DirectX 11-compatible tessellation unit from AMD.
Before discussing what improvements have been implemented in the Barts in terms of tessellation, let’s take a look at the whole DirectX 11 tessellation pipeline.
Briefly, the hull shader calculates tessellation parameters for each patch edge and defines the number of fragments to split each edge into. The tessellator calculates the coordinates of each new vertex. The domain shader sends data (texture coordinates, UVW coordinates, etc.) about all vertexes along the pipeline. The hull shader can optionally convert the control points of a triangular patch into control points of a square patch. Therefore data can be sent directly from the hull shader to the domain shader.
As you can understand from this brief description, tessellation is quite a complex process. It means that the tessellator’s ability to split primitives (patches) into multiple parts is one, but not the only, performance-limiting factor.
The new second-generation tessellation unit features a number of improvements, but not for the whole tessellation pipeline. The developers have improved thread management for domain shaders and re-sized some queues and buffers to achieve significantly higher peak throughput, particularly at lower tessellation levels. AMD warns that excessive tessellation with a polygon size below 16 pixels is harmful for performance, so we can infer that the Barts’ tessellator reaches its peak performance at that (or larger) triangle size.
This warning by AMD may be meant to explain why the Northern Islands GPUs may be inferior at very aggressive tessellation levels to the Fermi GPUs which incorporate numerous PolyMorph geometry engines. On the other hand, excessive tessellation can indeed be harmful because each new triangle involves more color calculations, texture sampling and other operations. Modern GPUs work with 2*2-pixel tiles, so it is desirable to have polygons the size of 4, 8, 16, 32, 64 pixels and so on. As soon as polygons become smaller than 4 pixels, the GPU has to process more tiles and slows down catastrophically. In other words, a modern GPU can suffer a terrible performance hit when processing 1-pixel polygons whereas the increased level of detail will hardly be noticeable during actual gameplay.
According to the official commentaries, the improvements in the Barts’ tessellator did not require much more transistors but helped make that unit twice as fast in some synthetic tasks. Like any other claim, this one needs checking out in practical applications, but if the new chip’s tessellation performance is indeed so high, the Nvidia GeForce GTX 460 will only have two technical advantages over its Radeon counterpart: PhysX and CUDA.
We can expect the future architectures, Southern Islands, Hecatonchires, etc., to bring some innovations into the very design of the tessellation pipeline, like what Nvidia offers in its Fermi architecture where each large array of stream processors has a dedicated tessellator for optimal data threading.