Directly Unified: Nvidia GeForce 8800 Architecture Review

It has long being rumored over the Web that the new generation of graphics processors from the leading developers would have an architecture completely different from everything we’ve ever seen on the market as yet. Today we’ve got a chance to give a scrutinizing eye to the earliest representative of the new generation, the Nvidia G80 processor, and to the GeForce 8800 GTX graphics card based on that GPU.

by Alexey Stepin , Yaroslav Lyssenko, Anton Shilov
11/09/2006 | 06:34 PM

The never-ending struggle between the graphics market giants, ATI Technologies and Nvidia continued through all of the year of 2006. In January ATI announced its highly successful R580 GPU that was an improved version of the earlier R520 chip. Notwithstanding its then arguable architectural concept, the new GPU made the company a technological leader. Nvidia responded in March, rolling out its G71 chip, a successor to the earlier G70, which helped restore the balance. There was another exchange of blows after that.

 

In early June Nvidia strengthened its positions by releasing its G71-based dual-GPU graphics card called GeForce 7950 GX2 and ATI answered to that in August with its Radeon X1950 XTX. Anyway, there has been some kind of balance between ATI and Nvidia up to this day. While the Radeon 1950 XTX was worse than the GeForce 7950 GX2 in terms of pure performance, it was considerably better in terms of image quality and compatibility.

As a matter of fact, Nvidia was long improving its graphics processors by bringing merely evolutional changes into their architecture. The G70 and G71 chips are in fact descendants of the NV40 GPU that was announced over two years back, on April 14, 2004. Notwithstanding their much higher performance, the newer chips have all the typical features of their ancestor. The NV40 was undoubtedly a revolutionary solution in its own time, but the example of ATI’s Radeon X800/X850 series suggests that old technologies cannot be improved upon indefinitely.

At some moment the resources of the current architecture become depleted and you can squeeze not a jot more performance out of it. The GeForce 7950 GX2 graphics card was itself an indication of such depletion, too, because Nvidia could only surpass ATI Technologies’ flagship solution based on ATI’s new architecture by using two G71 processors together.

So, the need to release a new architecture was becoming ever more urgent and Nvidia was of course working on it. Now that this work is over, Nvidia introduces its new GeForce 8graphics architecture embodied in the GeForce 8800 graphics processor. It’s this revolutionary architecture that is expected to bring the technological leadership back to Nvidia that we will be talking about in this overview.

DirectX 10

The Nvidia GeForce 8800 the 3D industry’s first chip to comply with DirectX 10 also known as WGF 2.0. Some features of this new API were described in our Windows Vista preview, but we’d want to tell you more about the capabilities and advantages of DirectX 10, now that we’ve got the first DirectX 10 compatible chip in our hands.

DirectX is the most popular and handy API for developing PC games. It is also rapidly developing in the console area: the Microsoft Xbox features hybrid Direct3D graphics cards. The Sony PlayStation 3 doesn’t use Direct3D, but is equipped with a hybrid of Nvidia’s GeForce 6/7 and GeForce 8. The success of the API from the major software developer is natural. Microsoft has been in fact defining the direction of progress in gaming 3D hardware by listening to both hardware and software developers. Adding, among other things, support for very long shaders into DirectX 9 Shader Model 3.0, Microsoft provided game developers with opportunities for further growth and also set new goals for itself.

These are the goals Microsoft tried to reach with its next-generation DirectX 10 API:

Compared with the previous versions of DirectX, DirectX 10 looks very impressive. Take a look:

The GPU developers report that Microsoft has done its job well and the new API indeed features a number of innovations over the previous version. It would take another article to review all the innovations in DirectX 10, so we’d better focus on the GeForce 8800 and its showings.

Dawn of Unified Architecture

Graphics processors have come a long way since their origin. Their evolution started with rather simple devices like the GeForce 256 that had a modest selection of fixed-function capabilities. Such chips could not even be called processors in the true sense of the word because they were unable to execute unique program code. It was the Nvidia GeForce 3 (NV20) that became the first truly programmable GPU. It could run pixel and vertex shaders described in the Direct X 8.0 specification.

Later, the graphics processor was evolving in terms of programmability, so that it could execute ever more complex shader code. The GPU eventually transformed into an almost all-purpose computing device with tremendous calculating power and capable of visualizing the most sophisticated special effect the game developer’s imagination could bring forth. With some reservations, it came to be not unlike an ordinary CPU in terms of performance and universality: the maximum length and complexity of shader programs grew up with every new version of DirectX until became virtually infinite in Shader Model 3.0. But all GPUs have had one fundamental limitation until now: their execution units were divided into those that ran pixel and those that ran vertex shaders. So, each graphics processor had to contain two separate sets of units to process each kind of shaders.

This division, although it had a number of advantages, had a negative effect on the overall GPU efficiency. For example, in a pixel shader heavy scene the available pixel processors may turn to lack performance whereas the computational resources of the vertex processors remain uncalled for, or vice versa. Thus, the next step in the evolution of the GPU was evident. The described misbalance problem could only be solved by unifying the shader processors so that the overall load could be distributed among them dynamically depending on the specifics of the processed scene. The new product from Nvidia, the GeForce 8800 (G80) GPU, is the realization of that concept.

Telling you the truth, Nvidia is not a pioneer in this field. It was ATI Technologies that introduced the first graphics chip with unified architecture. It is called Xenos (and is also known under the codename of R500). The Xenos is employed in Microsoft’s Xbox 360 console. It contains 48 unified shader processors and fully supports all of the Shader Model 3.0 capabilities (and even goes beyond them in some aspects). That chip can be regarded as a predecessor to the hero of this article.

GeForce 8800 in Detail

Execution Core

Nvidia’s GeForce 8800 sticks even closer to the unification ideology than the ATI Xenos. The heart of the new chip is a universal execution core that consists of 128 separate processors. This core works at a considerably higher clock rate than the rest of the G80’s subunits.

The stream processors are grouped into 8 blocks by 16 processors, each block being equipped with 4 texture modules and a shared L1 cache. A block consists of two shader processors (each of which consists of 8 stream processors), and all the eight blocks have access to any of the six L2 caches and to any of the six arrays of general-purpose registers. Thus, data processed by one shader processor can be used by another shader processor.

An important thing, the above-described design of the shader processors, caches and general-purpose registers allows disabling shader blocks or blocks of L2 cache, general-purpose registers and 64-bit memory controller in case of manufacturing defects to produce “cut-down” solutions to be sold at a lower price.

The data is converted into FP32 format by the Input Assembler. The Thread Processor distributes branches of code and optimizes load on the stream processors.

The GigaThread technology is an advanced analog to Ultra-Threading which ATI employs in its Radeon X1000 series. GigaThread allots shader blocks for processing vertex, geometric and pixel shaders depending on the overall load. Shaders of all types can be executed simultaneously if necessary and if possible.  The GigaThread processor also tries to minimize the moments of idleness of the G80’s shader blocks when texture sampling operations are being performed.

Each stream processor can perform two simultaneously issued scalar operations like MAD+MUL per cycle and the overall computing power of the core is, according to Nvidia, about 520 gigaflops. This is over two times that of the ATI R580 whose performance, according to ATI, is about 250 megaflops. We can make one interesting and perhaps arguable observation here. Each pixel processor in the R580 is known to have 2 scalar and 2 vector ALUs and a branch execution unit. So, it can execute up to 4 arithmetic instructions per cycle plus one branch instruction. It seems that the efficiency of one stream processor in the G80 is lower than the efficiency of one pixel processor in the R580, but the overall performance of the G80 is higher because it has more execution units (128 against 48) and clocks them at a higher frequency. Unfortunately, we don’t have any data about the design of an individual stream processor in the G80. We only know that it is fully scalar as opposed to the pixel processor of the last-generation architectures which contains both scalar and vector ALUs.

Each of the G80’s 128 stream processors is an ordinary ALU capable of processing data in floating-point format. It means that a stream processor can not only work with shaders of any type (vertex, pixel, geometric) but also process the physical model or perform other computations in the framework of the Compute Unified Device Architecture (CUDA). And it does that independently of the other processors. In other words, one part of the GeForce 8800 can be involved in some kind of computations while the other, for example, be busy visualizing the results of those computations because the streaming architecture allows using the output of one processor as the input for another processor.

The GPU efficiency at processing shaders with dynamic branching has been improved in comparison with the ATI Radeon X1900. The latter can process 48-pixel large branches whereas the GeForce 8800, from 16 to 32 pixels large. We can check out how efficient the execution of branching pixel shaders has become and will do this in the theoretical tests section.

Lumenex Engine

The G80 graphics processor can be viewed as consisting of two parts. We have described the execution core above. The other part is called Lumenex Engine and it is responsible for sampling and filtering textures as well as for full-screen antialiasing, HDR, and the output of the rendering results to the monitor. In other words, this part of the G80 incorporates texture caches, memory access interface, TMUs and ROPs.

The flowchart of the G80 shows that the 128 stream processors are organized into 8 groups with 16 processors in each. Each group has a corresponding texture sampling and filtering unit, consisting of 4 TMUs. So, the G80 contains a total of 32 TMUs each of which is designed like follows:

Each TMU contains one sampling and two filtering units. The speed of bi-linear and 2x anisotropic filtering is 32 pixels per cycle for each filtering type. Bi-linear filtering of FP16 textures is performed at the same speed whereas the speed of FP16 2:1 anisotropic filtering is 16 pixels per cycle. The GeForce 8800 GTX’s Lumenex Engine is clocked at 575MHz, so the theoretical scene fill rate is 18.4 gigatexels per second when both bi-linear and 2x anisotropic filtering are in use.

The raster operators, also part of the Lumenex Engine, are grouped in 6 sections, each of which can process 4 pixels (with 16 subpixel samples) per cycle, which provides a total of 24 pixels per cycle with color and Z values processing. If only the Z-buffer is employed, the max number of processed pixels is 192 per cycle in normal mode and 48 per cycle with 4x multisampling.

The ROP subsystem supports all kinds of antialiasing: multisampling, super-sampling and transparency antialiasing. In addition to the standard selection of FSAA modes, the new GPU offers 8x, 8xQ, 16x and 16xQ which will be discussed below. Antialiasing of textures in FP16 and FP32 formats is fully supported, so the problem of the GeForce 6 and 7 architectures that could not simultaneously use FSAA and FP HDR is solved in the GeForce 8.

Nvidia says the memory subsystem of the GeForce 8800 features a new controller, yet it hasn’t changed greatly since the GeForce 7 series. The number of sections has grown from 4 to 6, so the total memory bus width has grown from 256 bits (4x64) to 384 bits (6x64). Support for GDDR4 has been added, although even ordinary 900 (1800) MHz GDDR3 can provide a bandwidth of 86.4GB/s. The high frequencies of GDDR4 are not yet called for.

New FSAA and Anisotropic Filtering Modes

Developing its GeForce 8800, Nvidia took care not only about its speed, but also about its image quality, something that the company’s earlier products used to be inferior to ATI’s in. There are changes and improvements in both full-screen antialiasing and anisotropic filtering. We’ll discuss the new FSAA algorithms first.

Before the GeForce 8800, the best-quality FSAA supported by Nvidia’s solutions was the hybrid 8xS mode that combined super- and multisampling. It provides a superb quality of antialiasing, but super-sampling led to a terrible performance hit, making this mode unpractical. Thus, the maximum practical FSAA mode available on the GeForce 7 was ordinary 4x multisampling whereas the ATI Radeon X1000 could offer the user 6x multisampling, not as nice-looking as Nvidia’s 8xS, but much more suitable for practical purposes.

The GeForce 8800 now solves Nvidia’s problems with FSAA on single-chip graphics cards. First, 8x multisampling has been added to the list of available modes (it is called 8xQ in the ForceWare settings). Second, the new GPU supports three new antialiasing modes: 8x, 16x and 16xQ that use the so-called Coverage Sampling Antialiasing method (CSAA).

Theoretically, 16x multisampling could be used earlier, too, but it’s only with CSAA that this high-quality antialiasing method can be normally used in practice with a performance hit similar to that with ordinary 4x multisampling.

The main difference of CSAA from ordinary multisampling is in minimizing the number of combined samples. With 16x multisampling antialiasing, there are 16 samples per each pixel on the screen. With CSAA, there are fewer samples and, accordingly, a smaller performance hit.

It should be noted that the CSAA method saves only on the number of color/Z samples, but not on the number of samples from the so-called coverage mesh. So, the percentage of samples fitting into the triangle referring to the original pixel is much higher than with the classic 4x MSAA. The economy on the number of color samples is justifiable because it is the color information that puts the biggest load on the graphics memory subsystem whereas the degradation in the antialiasing quality with 4 color/Z samples per pixel and at 16 coverage samples will be inconspicuous. It may be conspicuous if there’s a high contrast between the antialiased polygon and the background, because 4 samples may be insufficient for an accurate averaging of the color of the resulting pixel, and the antialiasing quality will be close to 4x MSAA.

The GeForce 8800 supports three CSAA modes in total: 8x, 16x and 16xQ. The first two modes operate with 8 and 16 coverage samples, respectively, at 4 color/Z samples. The 16xQ mode provides the highest quality and uses 8 color/depth samples, thus approaching the classic 16x MSAA in quality.

Nvidia’s new approach reduces memory usage and saves its bandwidth and thus provides a higher level of performance than with full-scale 16x multisampling. The resulting quality is close to the latter method and is much higher than what is provided by the notorious SLI AA 16x mode with its effect of blurred textures. According to Nvidia, the card’s performance is a mere 10-20% lower in 16x CSAA mode than in 4x MSAA mode. We’ll check out this claim later on.

As for anisotropic filtering, the GeForce 8800 features an algorithm whose filtering quality doesn’t depend on the angle of inclination of the texture plane. It is similar to the algorithm ATI’s cards use on your enabling the High Quality AF option, but Nvidia says it still provides a somewhat better filtering quality. We’ll check this out later, too.

Nvidia GeForce 8800

Specification

With all the architectural innovations, it is not surprising that the new graphics processor from Nvidia is a very complex chip, incorporating a total of 681 million transistors, even though the TMDS transmitters and RAMDAC have been moved into a separate chip! For comparison: the number of transistors in modern desktop CPUs varies from 154 million (AMD Athlon 64 X2) to 582 million (Intel Kentsfield), most of which, as opposed to the G80, make up the L2 cache. Nvidia didn’t take risks with such a complex chip and began to manufacture it on TSMC’s time-tested 0.09-micron tech process. The company also managed to achieve high enough frequencies, making the G80 core stable at 575MHz whereas the chip’s shader processors are clocked at 1350MHz.

The GeForce 8800 family includes two models as of the time of its announcement: the $599 flagship GeForce 8800 GTX and the $449 GeForce 8800 GTS. So, even the senior model of the new series is cheaper than the GeForce 7950 GX2 whose recommended price was set at $649. As we wrote in our previous review, we can expect now a reduction of price of senior GeForce 7 models to $300-449.

As for the availability factor, the new graphics cards from Nvidia were expected to be available for purchase from major suppliers since the day of their announcement. However, two days before the launch date Nvidia called back its GeForce 8800 GTX cards due to a manufacturing error the contract manufacturer had made. Nvidia says some cards of the 8800 GTX models have an incorrect resistor which leads to visual artifacts in 3D applications. The problem of a single resistor can be easily solved on the spot by re-soldering it according to the developer’s instructions. Nvidia decided to call back at least some of 8800 GTX cards from the sales channels, but not to postpone the release date, although the availability of the new solution is going to be lower than expected. Fortunately, this problem doesn’t concern the less expensive GeForce 8800 GTS, so every user can buy it right on the day of its announcement.

Let’s get back to the technical specifications of the GeForce 8800 series, though. We put the specs of the highest-performance single-chip graphics cards into one table for better comparison:

It’s clear that even the junior GeForce 8800 GTS surpasses the GeForce 7900 GTX as well as the Radeon X1950 XTX in almost every parameter. It should be noted, however, that the flagship product of the ATI Radeon X19xx series features as much of graphics memory bandwidth as the Nvidia card. So, we’ve got a 320-bit bus and an expensive PCB wiring in one case, and expensive GDDR4 with a simpler PCB wiring for a 256-bit bus in the other case.

So, the GeForce 8800 GTX has no rivals in terms of technical characteristics, but how well is it going to perform under real-life conditions? The gaming tests will show if this debut is a success. Right now we’ll describe the design of the new graphics card for you.

Nvidia GeForce 8800 GTX

PCB Design

Implementing such a complex device as the GeForce 8800 GTX called for a new, original printed-circuit board. Such factors as the high power consumption of the G80 chip, the use of a separate chip containing TMDS transmitters and a RAMDAC, and the 384-bit memory bus all contributed to making the card very large.

 

In order to give you an understanding of how huge the new graphics card from Nvidia is, take a look at the following photograph:

As you can see, the GeForce 8800 GTX’s PCB is much longer than the Radeon X1900 XTX’s: 27.9cm against 23cm. This must be the reason why the power connectors have been moved from the reverse to the face side of the PCB so that you could plug the cables normally in a cramped system case. The massive cooler covers most of the graphics card with its components, so we had to unfasten 11 large spring-loaded screws and 8 smaller screws to remove it. Here’s the new card in all its nudity:

A large part of the PCB – over one third of it – is occupied by power circuitry that is capable of powering the 681-million-transistor chip and has power consumption comparable to that of today’s top-end CPUs. Nvidia has always been meticulous about powering its top-end graphics cards, and the GeForce 8800 GTX is not an exception. You computer must be equipped with a 450W or higher power supply that can yield a combined current of no less than 30A on its 12V power rail (in other words, each of the PSU’s “virtual” 12V output lines must sustain a load of at least 15A without triggering off the overcurrent protection). Both PCI Express power connectors must be attached to the card. If you don’t do that, it either won’t start up, emitting a loud warning signal, or will start up at reduced frequencies, depending on the connector attached.

In the bottom right corner there’s an ordinary piezo-speaker that is responsible for sending the warning signal and a 4-pin connector for the cooler’s fan. A little above them you can see a seat for a 6-pin connector which probably serves some engineering purposes.

A PWM controller Primarion PX3540 is the heart of the power circuit. It is located on the reverse side of the PCB.

The rest of the PCB, around the GPU and memory chips, looks oddly simple and unexciting except for the small FCBGA-packaged chip with an open die that is marked as NVIO-1-A3.

Nvidia didn’t make the G80 chip even more complex by integrating TMDS transmitters and a RAMDAC into it. These units used to be integrated into the GPU, but now they all reside in a separate special-purpose chip.

This helps avoid interference in the RAMDAC that could be caused by the shader processors working at 1.35GHz, and improves the chip yield of the new GPU. This solution makes the wiring of the new PCB somewhat more sophisticated, though. The external NVIO chip may also serve some other purposes besides just outputting the image. By the way, the NVIO is a kind of return to the roots because graphics cards once used to employ external RAMDACs.

History likes to tell the same old story anew: the last time we saw a GPU package with a metal heat-spreader, it was the Nvidia NV35 chip. The company abandoned it with the release of the NV40, limiting itself to a protective frame around the die. The high power consumption and the non-uniform heat dissipation of the different parts of the new GPU made Nvidia return to the heat-spreader idea. As a result, the G80 looks not unlike the modern CPUs from Intel and AMD. This design ensures better heat transfer and minimizes the risk of damage to the fragile die. The cooler of the GeForce 8800 being rather heavy, there is a metal frame around the GPU, fastened right to the PCB with 8 small screws, to avoid putting too heavy a weight on the chip as well as to distribute the weight of the cooler uniformly on the PCB.

The CPU marking doesn’t tell its codename or official name. There is only the updated Nvidia logotype there. The marking shows the date of manufacture and the chip revision. Here, it is the 37th week of the current year, i.e. between the 11th and 17th of September. At that moment Nvidia already had fully functional revision A2 samples of the G80, i.e. the third revision of the chip. The number 507 is written in the top left corner with a blue marker. Perhaps it is a unique number of the chip or graphics card.

The clock rate of the GeForce 8800 GTX GPU is 575MHz, which is an achievement for a 0.09-micron die consisting of 681 million transistors. This is not a limit, though. The shader processors in the senior GeForce 8800 are working at 1350MHz. The G80 architecture resembles Intel’s NetBurst in this respect: the ALUs work at two times the main frequency there.

The circular placement of memory chips when some of them are installed at an angle of 45 degrees has been changed for a simpler placement: the 12 GDDR3 chips are positioned around the GPU in three straight rows, two vertical and one horizontal. Each chip is organized as 16Mx32. They are accessed across a 384-bit bus. The junior GeForce 8800 GTS carries only 10 chips on board, which narrows its memory bus to 320 bits.

These Samsung’s K4J52324QE-BJ1A chips are a RoHS version of the widespread K4J52324QC series. The manufacturer’s website doesn’t yet describe them, but they seem to be the same 2.0V chips capable of working at frequencies up to 900 (1800) MHz. This is indeed the frequency the chips are clocked at on the GeForce 8800 GTX card. The use of a 384-bit memory bus helped Nvidia achieve an impressive memory bandwidth of 86.4GB/s without installing the rare and expensive GDDR4.

There are two MIO connectors in the top left corner of the PCB. This may mean different things: 1) bi-directional data transfers, 2) increased bandwidth by using both the MIO interfaces integrated into the GPU at the same time, 3) an opportunity to use four GeForce 8800 GTX cards in a Quad SLI configuration. Although it’s hard to imagine such a monster in a home PC – 8 power connectors and a power consumption of over 500 watts – the latter thing seems plausible. This is also confirmed by the fact that the junior GeForce 8800 GTS has only one MIO connector and Nvidia wouldn’t have reduced the efficiency of a SLI tandem you can build out of two such cards just for the sake of paltry economy. Quad SLI is quite a different thing. It is a premium technology that is targeted at a narrow group of users who want to have the highest performance possible whatever its price may be. Looking for more speed, such users will buy the GeForce 8800 GTX rather than the cheaper GeForce 8800 GTS, for which the option of working in a Quad SLI subsystem is virtually unnecessary.

The configuration of the external connectors is standard: two DVI-I ports with support for dual-link and HDCP, and a universal 7-pin S-Video/YPbPr port. The card lacks VIVO functionality, but that’s not at all important today.

Cooling System

Now let’s have a look at the cooler installed on the GeForce 8800 GTX card. Obviously, it is expected to dissipate the same amount of heat as today’s CPU coolers do, yet keep rather compact. As opposed to the CPU cooler, the graphics cooler has little room for growth as it is limited to the dimensions of the graphics card. How did Nvidia solve that riddle?

It has already become clear that the cooling system of a modern graphics card must exhaust hot air out of the system case. The extra 120-150 watts of heat inside the system case would make the thermal conditions there unbearable, especially if the system features a top-end CPU. ATI Technologies has been using such coolers since the Radeon X850, but Nvidia has only partially followed this concept until now (if you don’t count in the notorious cooling system of the GeForce FX 5800 Ultra). As you know from our reviews, the cooler of the GeForce 7900 GTX exhausts only a portion of hot air out of the case whereas the cooler of the GeForce 7950 GX2 doesn’t do even that due to the dual-PCB design of that graphics card. Developing the new cooling system for the GeForce 8800 series, Nvidia tried to address the older flaws and get rid of them where possible. Here’s what they have come up with in the end:

The new cooler resembles the device that was installed on the GeForce 6800 Ultra, but larger and turned around by 180 degrees so that the hot air was exhausted through the slits in the graphics card’s mounting bracket rather than into the system case. Of course, the cooler is more sophisticated than the one installed on the GeForce 6800 Ultra because the G80 generates more heat than the NV40. To our surprise, Nvidia didn’t use a copper heatsink as ATI did on its cards. You can see it through the slits in the casing that the heatsink, as before, consists of thin aluminum plates. We removed the cooler’s casing to see the following picture:

We can see the same component layout as in the ATI Radeon X1950 XTX cooler: the heat generated by the GPU is transferred to the massive copper base and is evenly distributed in the heatsink by means of a heat pipe, which greatly facilitates the transfer of heat.

The base, the heatsink and the fan are installed on a light aluminum frame that has protrusions opposite to the memory chips, to the chip containing TMDS transmitters and RAMDAC, and to the switching MOS transistors in the power circuit. In every case, there are traditional thermal pads made from some non-organic fiber and soaked in white thermal grease that serve as the thermal interface. There’s a layer of dark-gray thick thermal paste between the copper sole and the GPU cap. The frame has rectangular slits near the fan which helps improve the cooling of the power elements and the PCB by taking air in through those slits.

The heatsink is cooled with a blower whose airflow is directed perpendicular to the axis of its blades. The static pressure of the stream of air is higher than with the classic axial fans of the same wattage. A blower is the optimal choice for this cooling system design as it can effectively blow through the long densely-ribbed heatsink with high aerodynamic resistance.

The 5.8W fan employed by Nvidia has an input current of 0.48A at 12V voltage. At its highest speed the fan must be unbearably loud, but we hope the fan speed management system of the GeForce 8800 GTX will do its job well.

The cooling system of the Nvidia GeForce 8800 GTX card seems to be a logical, complete solution that is quite capable of cooling such a powerful chip as the G80. The use of aluminum instead of copper in the main heatsink is somewhat alarming – this may require the fan to rotate at a high speed and, accordingly, to produce more noise. We’ll check this out in the next section.

Noise and Power Consumption

We measured the level of noise produced by the graphics cards’ coolers with a digital sound-level meter Velleman DVM1326 (0.1dB resolution) using A-curve weighing. At the time of our tests the level of ambient noise in our lab was 36dBA and the level of noise at a distance of 1 meter from a working testbed with a passively cooled graphics card inside was 40dBA. We got the following results:

Contrary to our apprehensions, Nvidia’s new cooler proves to be rather quiet. The GeForce 8800 GTX is comparable in this parameter to the GeForce 7900 GTX, which is one of the quietest graphics cards available. The new card is quiet in every operation mode while the fan speed management system only reminds of itself for the first few seconds on your starting the PC up. That’s an impressive achievement considering that the GeForce 8800 GTX’s cooler must dissipate over 130W of heat. We applaud to Nvidia’s engineers that have managed to copy the superb noise characteristics of the GeForce 7900 GTX cooler while keeping within the much harder thermal design requirements of the GeForce 8800 GTX.

Unfortunately, we couldn’t measure the power consumption of the GeForce 8800 GTX because our testbed with a modified Intel Desktop Board D925XCV turned to be incompatible with Nvidia’s new card. The system started up and initialized successfully and began to boot the OS up, but the graphics card didn’t output any signal to the monitor. So, we have to quote Nvidia’s numbers here.

According to Nvidia, the GeForce 8800 GTX doesn’t show anything extraordinary in terms of power consumption. Its power draw of 145.5W under max load is quite an expectable value for a 0.09-micron chip consisting of 681 million transistors most of which are clocked at 1.35GHz. There’s no reason to worry. As we’ve found out, the cooling system employed by Nvidia copes with that load quite successfully and with little noise.

Overclocking

The specification of the GeForce 8800 GTX – its complex PCB, tech process, clock rates of the G80 chip and the amount of transistors in it – couldn’t raise any hopes for good overclocking. We thought the overclockability of such a complex graphics card must be near zero.

However, the reality broke our theoretical constructions as we increased the main GPU frequency from 575 to 625MHz and the card was stable at it for a long time. We don’t know what the frequency of the shader processors was at that, if it changed at all, because it may be fixed at 1350MHz. If this frequency indeed grew up proportionally, it must have been 1467MHz. This 8% frequency growth should have affected the heat dissipation of the GPU.

The memory frequency was increased by 50MHz, too, from 900 (1800) to 950 (1900) MHz. Considering the 384-bit memory access bus that made the wiring of the PCB more complex and that we exceeded the rating frequency of the memory chips, this is a good enough result. Anyway, the GeForce 8800 GTX can be overclocked without extreme methods like volt-modding and without replacing its native cooler with a higher-performance water- or cryogen-based one.

Now it’s time to see the new graphics architecture in action.

Testbed and Methods

We benchmarked the GeForce 8800 GTX card on platforms with the following components:

The drivers were set up in such a way as to deliver the best quality of texture filtering.

ATI Catalyst:

Nvidia ForceWare:

Since we’re dealing with a completely new graphics architecture, we will first carry out a short theoretical investigation before running real-life games on our GeForce 8800 GTX. This will help us identify the weak and strong points of the GeForce 8800 architecture. We use the following programs for that:

FSAA and Anisotropic Filtering Quality

Before running our theoretical tests on the GeForce 8800 GTX, we want to check out its FSAA and anisotropic filtering quality.

The mechanism of control over the FSAA modes has been improved in the new version of ForceWare. The Antialiasing Mode selection window now offers the Enhance the application setting option. It is meant for those applications that support FSAA and allows turning it on from their own menu, but only offer a standard selection of modes (2x, 4x, and, occasionally, 8x). If the Enhance the application setting option is not enabled, choosing the 8x mode in the game will turn on 8x MSAA rather than 8x CSAA. But if that option is enabled, the driver will identify if FSAA is turned on in the game’s settings and will replace 2x/4x FSAA with the CSAA mode selected in the Antialiasing – Setting list: 8x, 16x or 16xQ.

This mechanism is meant to provide better compatibility and stability in those games that allow using FSAA but are limited to the standard FSAA modes. But if the game doesn’t offer FSAA settings at all, you should use the Override any application setting option. In some cases, however, the forcing of FSAA may lead to visual artifacts or instability of the game. All the screenshots below were taken in the Enhance the application setting mode. We checked them out with ATI TheCompressonator and verified their per-pixel identity to screenshots we got in the Override any application setting mode.

Of course, we were mostly interested in the new CSAA algorithm, but we also paid attention to the new 8xQ mode that uses the pure MSAA method as opposed to the 8xS mode. So, let’s be methodical:

MSAA 4x/8x vs. CSAA 8x

Half-Life 2


MSAA 4x


MSAA 8x


CSAA 8x

Elder Scrolls: Oblivion


MSAA 4x


MSAA 8x


CSAA 8x

The difference between the classic 8x MSAA and 8x CSAA that uses 4 color/Z samples per pixel is small and barely noticeable with a naked eye. We examined the screenshots with the ATI TheCompressonator utility and indeed found a few differences. They are the most conspicuous where we had anticipated, i.e. in high-contrast areas like in the top left corner of the Half-Life 2 screenshot where a part of the wire fence is located against a brightly lit wall. It can be seen under magnification that the 8xQ mode ensures a higher accuracy of calculation of the color of the resulting pixel, so the antialiasing looks more uniform through all the fence.

8x CSAA doesn’t provide any great advantages over 4x MSAA. The difference is mainly in the better quality of antialiasing of micro-geometry (ropes, armature, etc).

CSAA 8x vs. Supersampling and SLI AA 8x

Half-Life 2


CSAA 8x


Supersampling


SLI AA 8x

Elder Scrolls: Oblivion


CSAA 8x


Supersampling


SLI AA 8x

When we compare 8x CSAA with the hybrid 8xS mode, the overall smoothness of the scene with the latter method is striking. This is the effect of super-sampling. Super-sampling initially operates with a larger number of texture samples, so not only the edges of polygons, but also the wire fence from Half-Life 2 get anti-aliased although the wire mesh doesn’t consist of polygons, but is a semitransparent texture. Both modes provide a rather high quality of antialiasing, but you should keep it in mind that 8xS blurs textures somewhat, which may reduce the quality of small details in the scene, and may also prove too resource-consuming to be used for real gaming.

The 8x SLI AA mode is comparable to 8x CSAA in quality, although tries to smooth out semitransparent textures, too. In any case, it is only available on two graphics cards united into a SLI tandem and is of no interest for owners of a single graphics card. Fortunately, Nvidia’s innovations leave users a choice they didn’t have with GeForce 7 series cards.

MSAA 4x/8x vs. CSAA 16x/16xQ

Half-Life 2


MSAA 4x


MSAA 8x


CSAA 16x


CSAA 16xQ

Elder Scrolls: Oblivion


MSAA 4x


MSAA 8x


CSAA 16x


CSAA 16xQ

The main and loudly touted feature of CSAA is that it needs less resources than the classic multisampling. Nvidia claims the performance in 16x CSAA mode is going to be just a little lower than at 4x MSAA. We’ll talk about the speed factor below. Right now let’s discuss the quality factor.

16x CSAA ensures a higher-precision antialiasing in comparison with 4x MSAA, especially on small scene details, because it uses 4 times the number of coverage samples. It keeps the same color information as 4x MSAA, so the smoothed-out edges of polygons may look not as ideal as they would be with 16x MSAA.

Such a high level of multisampling would be too heavy even for the GeForce 8800 GTX, but we can compare 16x CSAA with 8x MSAA. The difference is smaller here and can hardly be revealed even with TheCompressonator, but 8x MSAA calculates the final pixel color with more accuracy than 16x CSAA does and, unlike 4x MSAA, doesn’t suffer much from the errors at determining if a pixel belongs to the smoothed-out polygon. 8x MSAA looks obviously better than 16x CSAA in terms of image quality, but we shouldn’t forget about speed.

It’s even harder to see the difference between 16x CSAA and 16xQ CSAA which stores more information about the color and depth (8 samples instead of 4), but 16xQ CSAA is surely the highest-quality antialiasing mode available today on a single graphics card.

CSAA 16x/16xQ vs. SLI AA 16x

Half-Life 2


CSAA 16x


CSAA 16xQ


SLI AA 16x

Elder Scrolls: Oblivion


CSAA 16x


CSAA 16xQ


SLI AA 16x

It’s similar to the above-described situation with 8x CSAA, 8xS FSAA and 8x SLI AA, but with more conspicuous symptoms. The 16x SLI AA mode performs antialiasing on transparent surfaces like the foliage or wire fence, but also makes the scene somewhat blurry. Some small details may be lost, like the relief on the masonry in Oblivion . Besides that, 16x SLI AA requires two graphics cards whereas the 16xQ CSAA mode provides higher sharpness and works on a single GeForce 8800, so it is indeed the highest-quality antialiasing method available on single graphics cards.

Beside comparing the quality of the new full-screen antialiasing modes implemented in the GeForce 8800, we also decided to check out the quality of anisotropic filtering provided by ATI’s and Nvidia’s flagship products. GeForce 7 series cards are expelled from this test as not compliant with the high image quality standards. For the Radeon X1950 XTX we enabled the High Quality AF mode. Here are the results:

Anisotropic Filtering

G80

R580


Quality


High Quality


High Quality

The Nvidia GeForce 8800 uses the new anisotropic filtering always irrespective of enabled/disabled optimizations and the filtering quality mode, High Quality or Quality. In the latter case the transition between the mip-levels degenerate and the filtering quality isn’t high, yet it doesn’t depend much on the angle of inclination of the plane of the filtered texture.

In the High Quality mode the quality of anisotropic filtering provided by the GeForce 8800 GTX is nearly ideal and free from any defects. It is even higher than that of the Radeon X1950 XTX, the recognized ex-leader in this area. The diagram with the colored mip-levels shows the characteristic surges that are indicative of the not-exactly-true method of anisotropic filtering employed by the ATI card – there are certain “inconvenient” angles with this method. The new algorithm from Nvidia is free from this defect. Note also that ATI’s algorithm is somewhat more aggressive as is visible on the textures closest to the user which are sharper than on the GeForce 8800. This may result in the effect of “flickering” textures in games.

Performance in FSAA Modes

Besides comparing the quality of the different FSAA modes supported by the GeForce 8800, we also compared their influence on the speed of the card in the popular 3D shooter Half-Life 2: Episode One that runs on the advanced Source engine. The results are listed below:

It’s clear that the simpler 2x and 4x FSAA modes are virtually free on such a powerful solution as the GeForce 8800 GTX, as opposed to the ex-flagship GeForce 7950 GX2. The latter suffers a performance hit in high resolutions in spite of its total of 48 TMUs and 32 ROPs.

It’s more interesting with the high-quality antialiasing modes. The 8xS mode is useless even in 1600x1200 whereas the only mode where the performance of the GeForce 8800 GTX declines with the growth of the resolution is CSAA 16xQ. The results of the card in 8x CSAA, 8xQ MSAA and 16x CSAA modes are identical. It means you can use either of them in Half-Life 2: Episode One, but the 8xQ MSAA mode is going to be the optimal choice if you want to have the highest antialiasing quality without losing in speed. As we’ve found out already, this mode is the most optimal in terms of “hit accuracy” and the precision the color of the final pixel is calculated at.

The CSAA 16xQ mode can also be used in practice, but it is heavy even for the GeForce 8800 GTX in the resolution of 1920x1200 pixels. Nvidia didn’t implement 16x MSAA and quite rightly because this would provide a negligible quality improvement over 16xQ CSAA while the performance hit would be too big for practical purposes.

After discussing FSAA and anisotropic filtering issues we can now proceed to theoretical tests.

Performance in Theoretical Tests

Scene Fill Rate

The GeForce 8800 GTX behaves in a predictable way in this test. In ordinary terms, the chip has 32 TMUs and 24 ROPs and this helps it beat the last-generation solutions. As for working with the Z-buffer, the G80 GPU is known to be able to process a double number of Z-values per cycle with enabled 4x multisampling and up to 192 Z-values when FSAA is disabled. We don’t use FSAA in our theoretical tests, so we’re dealing with the second operation mode here.

So, it’s all right with the fill rate of the GeForce 8800 GTX card. It is about 33% faster than the GeForce 7900 GTX even in the hardest case, which is expectable considering the difference in the number of their TMUs and ROPs and the increased memory bandwidth of the newer solution.

Pixel Shader Performance

The new card’s behavior when executing pixel shaders differs somewhat from the cards on last-generation GPUs. Although the first considerable performance hit occurs when the card goes from the simple version 2.0 pixel shader to the more complex PS 2.0 Longer, the card then even performs faster with the 4 Registers PS 2.0 shader. It’s only with the per-pixel lighting shader that we see one more big performance hit.

You can note that in the last case the GeForce 8800 GTX is only two times faster than the Radeon X1950 XTX despite its 128 stream processors clocked at 1.35GHz. It seems that the performance of the stream processors is limited by other factors, perhaps by the performance of the TMUs.

There’s an increase in performance in almost every subtest in Xbitmark. The only exception is the Plaid Fabric shader whose main feature is the sampling from a 3D texture. In this case, the three graphics cards running in this test deliver similar results, which implies some limitation, perhaps on the software level, i.e. the support of the GeForce 8800 in the new version of the ForceWare driver.

The G80 incorporates special-purpose branch execution units similar to those in the ATI X1000 architecture, so the new chip easily crunches through shaders with dynamic branching and is even more efficient at that than its opponent.

So, the superb pixel shader processing potential of the new Nvidia card is evident, but it’s too early to make any conclusions yet. Let’s see what we have in other tests.

The GeForce 8800 GTX is about 30% faster than the Radeon X1950 XTX in the pixel shader test from 3DMark05. But considering the similar behavior of so different graphics cards, we suspect that it is the graphics memory bandwidth that is the bottleneck here.

The pixel shader test from 3DMark06 uses a similar shader, but produces different results. The GeForce 8800 GTX enjoys a bigger advantage over the Radeon X1950 XTX in this test, especially in the resolution of 1280x1024 pixels. Still, we think that it is not the computing power, but the speed of texture sampling, memory controllers and/or caches is the main performance-limiting factor here.

So, the GeForce 8800 GTX shows its very best in the pixel shader tests. When high math1ematical performance is needed, the 128 stream processors clocked at 1.35GHz are unrivalled. And if the pixel shader contains a lot of texture lookups, the card provides a performance growth, too, thanks to its increased memory bandwidth and the 32 texture address units. These two factors will surely affect the card’s performance in real games.

Now let’s see how well the unified architecture is going to execute vertex shaders.

Vertex Shader Performance

When working with “pure” geometry, the GeForce 8800 GTX is for some unclear reason slower than the last-generation solutions with their dedicated vertex processors, but the unified architecture shows its best as soon as there appear light sources in the scene. The GeForce 8800 GTX never slows down much here even when it has to process as many as eight light sources.

So, this is a striking example of the superiority of a unified architecture over an architecture with independent blocks of pixel and vertex processors: 128 stream processors show an exceptional speed against 8 special-purpose ones (even though containing several ALUs in each).

When it comes to rendering several highly polygonal models in a scene with one light source, The GeForce 8800 GTX behaves exactly as it did in Xbitmark without any light sources, being a little slower than the Radeon X1950 XTX.

It’s different in the test that renders vegetation in which every blade of grass is rendered independently with the help of a vertex shader. This scene is closer to real game scenes than the previous one and the GeForce 8800 GTX feels at ease here, speeding up suddenly. It enjoys a 65% advantage over the GeForce 7900 GTX.

The analogous test from 3DMark06 produces different results. It’s Radeon X1950 XTX that is an outsider here since its vertex processors work at a lower frequency in comparison with the GeForce 7900 GTX: 650MHz against 700MHz. The GeForce 8800 GTX shows almost the same result as the ex-flagship of Nvidia’s single-chip graphics cards.

The results of the Complex Vertex Shader test coincide with the numbers from the same-name 3DMark05 test with an up to 0.1-0.2 precision.

Other Theoretical Tests

This test helps check out the efficiency of processing the physical model of a system of particles by means of pixel shaders and of its visualization with vertex texturing. The ATI Radeon X1950 XTX, which doesn’t support texture sampling in vertex shaders, defaults from this test while the new architecture from Nvidia is as fast as the old one.

The unified-architecture GeForce 8800 GTX suits ideally for the task of sampling textures in vertex shaders, so it is not wonder it routs its predecessor, delivering from 7.5 to 8 times its performance.

In the Perlin Noise test the realistic clouds that are changing in real time are generated by means of a pixel shader that contains 447 math1ematic instructions and 48 texture lookups. So, this test shows how well the GPU can work with long shaders (the Shader Model 3.0 specification requires support for 512-intruction-long shaders).

As in a majority of previous theoretical tests, the GeForce 8800 GTX shows itself a truly new-generation graphics card, enjoying a big advantage over the ex-leader Radeon X1950 XTX. Here, it profits greatly by its high shader processor frequency as well as by its ability to perform 32 texture lookups per cycle whereas the ATI card can perform only 16.

Conclusion

We can’t make any conclusions about the GeForce 8800 GTX as a gaming card until we’ve finished our gaming tests, but the new architecture from Nvidia boasts an impressive potential, beyond any doubt. This is obvious even from the theoretical tests where the GeForce 8800 GTX is the absolute winner. Moreover, we suspect that our tests cannot fully show the advantages of the new graphics cards from Nvidia over its last-generation opponents.

On one hand, we often see the performance of the GeForce 8800 GTX being limited by the bandwidth of its TMUs or onboard memory, which conceals the huge difference between the GeForce 8800 GTX and the Radeon X1950 XTX in raw computing power. On the other hand, the unified shader architecture is a priori better in synthetic benchmarks because the whole computing power of the GPU is allotted for one task.

Nvidia also took care about the image quality provided by its new flagship: the new FSAA and anisotropic filtering algorithms are indeed an improvement over the previous products. This is most important for the GeForce series that used to lag behind its opponents from the Radeon X1000 series in terms of image quality.

That said, the true potential of the GeForce 8 will only be revealed after the release of Windows Vista and DirectX 10 which promise us a lot of improvements in future games in terms of speed as well as quality.

The triumph came to Nvidia at a price: the 0.09-micron G80 is very large and uneconomical chip with high heat dissipation. Combined with the wider, 384-bit memory bus, this made the PCB of the new card very large and expensive, too. So, there were some problems as a consequence. As you may know, the senior GeForce 8800 model was withdrawn from sales channels not long before the announcement due to technical issues that made the first series of the GeForce 8800 GTX inoperable. Fortunately, that problem doesn’t concern the GeForce 8800 GTS, yet the delay with the shipments of the world’s fastest graphics card is no good for Nvidia. We hope this problem will be solved in near future and the GeForce 8800 GTX will be available for everyone who wants it. Nvidia says that GeForce 8800 GTX, even though in limited quantities, will indeed be available right after its announcement,

We’ll voice our final verdict about the GeForce 8800 GTX and its market perspectives in our upcoming article dedicated to the gaming performance of the new card. It will be published soon on our site.