NVIDIA GeForce 6800 Ultra and GeForce 6800: NV40 Enters the Scene

One of the most highly-anticipated graphics processors from NVIDIA Corp. has finally seen the light of the day. While the chip has everything the GeForce FX boasted it delivers cutting-edge performance times ahead of the previous leaders. Make sure to check out the in-depth analysis of the GeForce 6 architecture and be inspired by performance figures in 22 game benchmarks!

by Tim Tscheblockov , Alexey Stepin , Anton Shilov
04/14/2004 | 01:48 AM

You may remember that about a year ago NVIDIA Corporation suffered the greatest fiasco in its history. The company’s new NV30 graphics processor had been supposed to shatter ATI Technologies that was dominating the DirectX 9-compatibles field, but couldn’t eventually do that. The new architecture developed in NVIDIA’s labs went far beyond the requirements of the DirectX 9 standard, but was clumsy with its huge number of transistors. Its first embodiment, the GeForce FX 5800 Ultra, was a failure. The peculiarities of the GPU architecture together with the introduction of high-speed GDDR2 memory across the narrow 128-bit memory bus made the newcomer from NVIDIA even slower than ATI’s RADEON 9700 PRO in real applications.

 

NVIDIA corrected the mistake quickly by releasing an improved version, the NV35, a much more viable solution. Anyway, the NV30 left a gloomy impression: the company stopped producing such chips after issuing just a few tens of thousands of them and removed all mentions of the NV30 from their website.

A year passed and NVIDIA released a few successful products, replacing the NV35 with the NV38 core. Besides that, they abandoned the Detonator driver, having switched to the ForceWare suite with its special shader-code compiler introduced for higher performance in modern 3D games. Anyway, graphics cards on chips from ATI Technologies like RADEON 9800 PRO and RADEON 9800 XT have remained the best choice for a PC enthusiast.

Of course, NVIDIA never gave up plans to take revenge on ATI. You certainly heard rumors about the NV40 core long before its actual implementation in silicon. They were talking about 16 rendering pipelines, pixel and vertex shaders version 3.0 support and various other innovations. The rumors were wide-reaching. Some well-informed sources stated that the new chip would work with GDDR3 memory clocked at 1600MHz, while the GPU clock rate would be up to 600MHz...

This situation resembles the one we had about a year ago: the upcoming product from NVIDIA looks impressive in preliminary specifications, rumors and press releases. NV30 also looked that impressive in its time. There’s again much talk about an imminent revolution in the 3D desktop graphics, just like in 2003.

This time, however, NVIDIA is more reserved in its promises. The company also decided to discard the suffix “FX” in the names of the upcoming products – they will be called just GeForce 6800 Ultra and GeForce 6800. I wonder if it is some kind of superstition and they don’t want the NV40 to repeat the fate of the NT30 (which had that “FX” suffix). On the other hand, they may just want to distinguish the new product series from the older one or change the brand completely.

Now, the time of rumoring is over, the new wonder chip from NVIDIA saw the light of day officially and we can talk about it having both official specs and the graphics card on our hands!

Closer Look

One feature of the GeForce 6800 Ultra catches the eye immediately: there are two power connectors. We saw something like that in the Volari Duo V8 Ultra and had expected to see that here: 220 million transistors and memory working at 1100MHz call for an appropriate power supply. It is another fact that really left us flabbergasted – NVIDIA recommends that you use a 480W and higher Power Supply Unit (PSU) to ensure stable work of the new GeForces! You don’t often find such a PSU even in hardcore overclockers’ systems; they don’t definitely come in mainstream system cases. Thus, the potential owner of a GeForce 6800 Ultra has to shell out about $100-150 for a proper PSU like those from Thermaltake – I don’t think this will add popularity to the new GPUs. Fortunately, our testbed had a 550W PSU inside so we had no power-related problems running our tests.

Quiet Cooler for a Powerful Card

The card itself looks less imposing than its predecessors (GeForce FX 5950 Ultra and GeForce FX 5900 Ultra). The major reason for that is NVIDIA’s decision to use a smaller more compact single-slot cooling system on heat pipes. To be precise, the system occupies one slot and a half rather than two slots, but anyway takes less space than previous cooling systems from NVIDIA. In any case, the PCI slot located right next to the AGP will be occupied with the cooling systems.

A centrifugal blower is still used for creating the air circulation. Considering the number of transistors (220 million) and the frequencies (starting at 400MHz), we have little hope for a noiseless cooling, notwithstanding the heat pipes. The fan still can’t automatically adjust its rotation speed. It is connected to the PCB with two wires, which signifies the absence of a tachometer.

The cooler design deserves a separate mention in our review, since it is pretty outstanding I should say. The heat is taken off the die via a thin-rib aluminum heatsink stuck to the chip with a thin layer of high-quality thermal paste. In our case the heatsink is a monolith one, i.e. the ribs and the footing are a solid piece of metal, which ensures higher heat dissipation efficiency. The fan hidden inside a plastic housing blows the air through this heatsink and then through the memory cooling system, which is also designed in a very interesting way. The memory chips are covered with an aluminum plate of an unusual shape, which dissipates the heat. However, since the chips located at the upper edge of the PCB do not get cooled by the air stream generated by the fan, their heat is moved along a flat heat pipe within reach of the air flow. The heat pipe there is equipped with an additional thin-rib heatsink section, which serves to dissipate the heat generated by GDDR3 chips. The thermal interface between the memory chips and the heatsink is the same as in case of GeForce FX 5950 Ultra: special fibrous pads sodden with white thermal paste. They appeared to be very efficient.

The GDDR3 memory chips are made by Samsung and work at the nominal 600MHz (1200MHz DDR) with the 1.6ns access time. However, in case of our GeForce 6800 Ultra sample they worked at a slightly lower frequency of 550MHz (1100MHz DDR). Despite much higher working frequencies, Gddr3 memory generates considerably less heat than GDDR2: the memory heatsink of our GeForce 6800 Ultra was just a little bit warm at work, just like the reverse side of the PCB opposite these chips. Finally the high-end graphics adapters got rid of this eternal GDDR2 spell: tremendous heat generation. Since GDDR3 is a much more convenient memory type than GDDR2, its share in contemporary graphics accelerators will only grow.

I was surprised to find out that the new cooling system is rather quiet. To be more exact, it generates quite a bit of noise when you start your computer and reboot it. In a little while the fan rotation speed management system calms down, as the temperature remains low enough and reduces the fan rotation speed, which makes the level of generated noise much lower, though you can still hear this fan. When you start 3D applications the fan doesn’t speed up to the utmost of its power like we saw with GeForce FX 5950 Ultra, so playing games with the new GeForce 6800 Ultra promises to be quite a pleasant occupation. I assume the fan speeds up only if the core temperature grows up to a certain threshold.

Welcome, the Shiny NV40!

The GPU die will undoubtedly thrill you with delight when you see it: it is huge! This is the price they had to pay for the tremendous number of transistors (220 million). The GPU die surface contacts the heatsink directly, because there is no protective lid on it, like in case of NV35 and NV38. They probably did it to improve the heat dissipation.

The GPU has a mirrored surface and the edges are protected with a special framing which prevents the heatsink from moving the wrong way. This is not a new solution, as ATI has already used it for their RADEON 9700 PRO. Our particular GPU sample was manufactured on week 12, 2004, i.e. about 4 weeks ago that is why we can call our GeForce 6800 Ultra a young boy :) Actually, our GPU worked at a relatively low core clock: when we activated CoolBits in the drivers, the core was reported to be working at 400MHz. It is lower than the core frequency of the NV38, but we shouldn’t consider high working frequency a must for high performance, as NV40 architecture is much more advanced than that of the previous generation solution: NV38.

Simplify the PCB Design

The PCB has mostly changed in the right part where the power circuitry is. The device consumes much more power than its predecessors, so some elements of the circuitry are covered with a small passive heatsink. The high-capacity capacitors are pretty few to our surprise. There are only four of them here, while the GeForce FX 5950 Ultra has five. Although the PCB carries quite a bit of SMD-capacitors, so I wouldn’t call the power circuitry of the new GeForce 6800 Ultra simple, especially keeping in mind the PSU requirements.

Anyway, the PCB design of GeForce 6800 Ultra is much simpler than that of GeForce FX 5950 Ultra or any other high-end solution from NVIDIA.

A small buzzer is evidently part of the security system and informs the user about overheating, fans failure or about the lack of power and so on. Tyan took a similar approach in its graphics cards, and we have already seen how efficient it is sometimes. Two power connectors are firmly fastened with a metal loop soldered to the PCB; they are marked as Primary and Secondary.

  

There are two DVI-I connectors at the left part of the PCB and the TV-out is placed unusually. This solution makes perfect sense considering the growing popularity of LCD monitors. If you’ve got a monitor with an analog input, you can attach it via a simple adapter included with the card.

The back side of the PCB doesn’t have additional memory chips as GDDR3 allows amassing the necessary 256MB with eight chips only. I would like to draw your attention to the memory wiring: it is pretty complex as it works at 550 (11200) MHz. Actually, the front side of the PCB may seem somewhat empty to you, because locating the memory chips on one side allowed moving a lot of other components to the bottom of the PCB. In particular, the additional TMDS-transmitter from Silicon Image as well as the whole lot of other smaller components has been also moved to the reverse side of the PCB.

The metal plate at the back serves for fastening the cooling system and preventing the PCB from damage. The cooler is attached by spring screws so that you didn’t crush the board by screwing them up too tightly.

The place where the connectors reside is interesting, too. As you see, each landing place has two connectors of DVI-I or D-Sub type. This means that the card may come in the following versions:

The last version will surely be the most popular of all, but the first variant may be appealing for some users, too. Overall, the PCB and the cooling system of the new card are not a revelation. Although the PCB differs from that of GeForce FX 5950 Ultra, there are some similarities, too. As for the cooling system, NVIDIA has recently been drawn to snail-shaped fans, while heat pipes are a logical solution considering the heat dissipation of the NV40.

The PCB design seems pretty smart, save for the two power connectors, which should both be used. That is not the most convenient solution, as you will need one extra PSU connector, which you don’t always have available (that depends on your system configuration). On the other hand, you’ll probable have to buy a new powerful PSU for your new NV40-based graphics card and such PSUs usually have enough Molex connectors. By the way, NVIDIA doesn’t recommend using power splitters.

500W PSU: How Much Power Do You Need?

Again, NVIDIA recommends that the new graphics cards were used with 480W and higher PSUs. This may be an overstatement as the card itself consumes 150W and thereabouts, but it does require stable and reliable power, especially under peak loads. Then, other system components should get their deserved power supply, too.

Today, there are many inexpensive 400/450W PSUs from obscure manufacturers who don’t attentively see to the quality of their products, particularly to the stability of output voltages and currents. Installed into a system case with such a PSU, a GeForce 6800 Ultra graphics card may not get the sufficient supply currents. This may be the reason for NVIDIA to stress the necessity of a 480W PSU – such units will surely provide necessary currents on +5V and +12V buses for the GeForce 6800 Ultra and the rest of the system components.

If you’ve got a high-quality PSU, you may only run into trouble in two cases: during GPU overclocking and when your system is stuffed with too many power-hungry components.

Once we undertook a few extreme overclocking experiments with the GeForce FX 5950 Ultra when we increased the GPU and graphics memory voltages. NVIDIA advises to use 350W PSUs for such cards, but we found the device unstable at some moment. Then we took a high-quality 550W PSU, thus increasing the stability of the supply current, and managed to raise the clock rates even higher.

Probably, NV40-based graphics cards will behave like that, too. What’s curious, 400MHz is the lowest frequency for the GeForce 6800 chip according to unofficial information. Any overclocking experiments or the oncoming release of the faster chip modifications will cause inevitable increase in heat dissipation and power consumption, while the stability requirements will remain the same. I wonder if NVIDIA will still require a 480W PSU or will push the bar higher, to 550W or 600W supplies…

The need to install the NV40-based card into a system with a good PSU also results from the fact that other devices in this system will probably match the card – something like a Pentium 4 Extreme Edition or Prescott, a pair of hard disk drives, a couple of optical drives, a top-end sound card, a wireless networking adapter, a bunch of USB devices, various additional controllers and so on. With all this stuff onboard, you shouldn’t be surprised to see your computer being much fastidious about the PSU. Thus, NVIDIA’s recommendation to buy a 500W PSU seems reasonable enough for your ultra-fast gaming station to run smoothly.

This is the harsh reality: modern electronic chips eat more power and produce more heat. Top-end CPUs and GPUs of the new generation won’t certainly consume less power than today. The PC enthusiast must have a powerful and high-quality PSU.

Specifications, Features and Peculiarities of NVIDIA GeForce 6800/6800 Ultra

NVIDIA’s last-generation graphics processors based around the NV30/NV35 architecture differed from ATI’s R300/R350 chips in higher flexibility, but the advantage in functionality of pixel processors led to a pretty slow shaders execution. This remains a sore spot of the CineFX/CineFX 2.0 architecture and GeForce FX chips in their competition with RADEON 9xxx graphics processors.

When developing the next-generation processor, NVIDIA couldn’t help relying on the previous experience. Of course, the company couldn’t stand a situation when solutions from its main rival were faster in nearly every price sector, in spite of NVIDIA’s formal technological superiority. That’s why besides further functionality enhancements and the introduced support of version 3.0 pixel shaders, they tried to give a performance boost to the NV40 and to enforce the weak sides of the CineFX architecture.

Here are the basic characteristics of the NVIDIA GeForce 6800 Ultra GPU in comparison with top-end graphics processors of the previous generation (NVIDIA GeForce FX 5950 Ultra and ATI RADEON 9800 XT).

 NVIDIA GeForce 6800 UltraNVIDIA GeForce FX 5950 UltraATI RADEON 9800 XT
Process technology0.13 micron0.13 micron0.15 micron
Transistor Count220 million125 million110 million
GPU clock-speed400MHz475MHz412MHz
Memory controller256 bit DDR/GDDR2/GDDR3256 bit DDR SDRAM256 bit DDR SDRAM
Memory clock-speed1100 (550) MHz950 (475 DDR) MHz730 (365 DDR) MHz
Peak theoretical memory bandwidth35.2GB/s28.3GB/s21.8GB/s
Maximum memory size512MB?256MB256MB
AGP interfaceAGP 3.0 4x/8xAGP 3.0 4x/8xAGP 3.0 4x/8x
Pixel processors, Pixel Shaders
Pixel shader version3.02.x2.0
Max number of pixels per clock1648
Max number of Z-values per clock3288
Number of TMUs1688
Static loops and branchingyesnono
Dynamic loops and branchingyesnono
Max number of textures per shader161616
Max number of texture instructionsn/a102432
Max number of arithmetic instructionsn/a102464 (+64)
Max number of instructions per shadern/a102496 (+64)
Registersn/a2 color registers,
512 (1024) constant registers,
8 texture coordinates registers,
16 TMU identification registers,
16 (32) temporary registers,
4 resulting color registers,
1 resulting Z register
2 color registers,
32 constant registers,
8 texture coordinates registers,
16 TMU identification registers,
12 temporary registers,
4 resulting color registers,
1 resulting Z register
Data representation formatsFixed point?; 16-bit float; 32-bit floatFixed point; 16-bit float; 32-bit floatFixed point; 16-bit float; 32-bit float
Internal pixel shader pipeline precision128-bit pixel precision; 32-bit float precision; 16-bit float precision128-bit pixel precision; 32-bit float precision; 16-bit float precision96-bit pixel precision; 24-bit float precision;
Multiple Render Targetsyesnoyes
FP Render Targetyesnoyes
Texture filtrationsBilinear, trilinear, anisotropic, trilinear + anisotropicBilinear, trilinear, anisotropic, trilinear + anisotropicBilinear, trilinear, anisotropic, trilinear + anisotropic
Max level of anisotropic filtering16x8x16x
Vertex processors, vertex shaders
Vertex shader version3.02.x2.0
Number of vertex processors634.0
Static loops and branchingyesyesyes
Dynamic loops and branchingyesyesno
Max number of instructions per shadern/a256256
Max number of instructions with loops extensions65536?6553665536
Registers 16 input registers,
16 temporary registers,
256 constant floating-point registers,
256 constant integer registers,
256 Boolean registers,
1 address register,
1 loops counter register,
8 output registers for texture coordinates,
1 fog color output register,
1 vertex position output register,
1 pixel size output register,
2 output registers for diffuse/mirror color component
16 input registers,
12 temporary registers,
256 constant floating-point registers,
16 constant integer registers,
16 Boolean registers,
1 address register,
1 loops counter register,
8 output registers for texture coordinates,
1 fog color output register,
1 vertex position output register,
1 pixel size output register,
2 output registers for diffuse/mirror color component
Data representation formats32-bit floating point32-bit floating point32-bit floating point
Texture read from vertex shaderyesnono
Tesselationnonono
Full Scene AntiAliasing
FSAA MethodsSupersampling, multisampling, rotated-grid multisampling?Supersampling, multisampling, ordered-grid supersampling, ordered-grid multisamplingRotated-grid multisampling
Number of samples2..82..82,4,6
Technologies aimed at higher memory bandwidth efficiency
Hidden Surfaces Removal (HSR)yesyesyes
Frame-buffer, z-buffer, texture compressionyesyesyes
Fast Z-clearyesyesyes

The list of parameters and features is impressive: this is a new-generation solution from any point of view. The pixel pipeline can output up to 16 pixels per clock cycle; there are 6 vertex processors – any competitor should beware of that power. The CineFX 3.0 architecture supports shaders version 3.0, a new full-screen anti-aliasing method, an improved anisotropic filtering algorithm, Ultra Shadow II technology for faster processing of shadows in next-generation games like Doom III. High-Precision Dynamic-Range technology allows building scenes with a high dynamic lighting coefficient.

Now let’s take a closer look at the basic characteristics of the NVIDIA NV40.

GeForce 6800/6800 Ultra Rendering Pipeline

You can’t talk about traditional pixel shaders with regard to the NV40 core and the CineFX 3.0 engine. Starting from the NV30 and CineFX, graphics processors from NVIDIA have one wide pixel pipeline with several pixels at its different stages at a time, rather than several independent pixel pipelines. The NV40 inherits the NV3x architecture, but with major improvements and additions.

The NV40 has a four times wider pixel pipeline than the NV35 had. Now there can be as many as 16 pixels in the pipeline simultaneously and the maximum output rate is 16 pixels per clock cycle. When more than one texture is mapped, the pixel output rate is lower. For example, 8 pixels per clock cycle when dealing with two textures. The pixel pipeline of the NV40, like that of the NV35, accelerates when working with the Z-buffer or the stencil buffer. The overall four-fold performance increase is felt here, too. Now the graphics processor can output 32 Z-values per clock cycle. Thus, the NV40 is simply boiling with raw power – we can speak about a quadruple fill-rate increase compared to the NV35.

Expanding the pixel pipeline, NVIDIA also improved the computational capability of the pixel processor. Firstly, the number of available temporary registers – the weak side of pixel processors in GeForce FX chips because of the peculiarities of the pixel pipeline architecture – has been increased. Complex pixel shaders calculated with full 32-bit precision shouldn’t now be an unbearable load for the GPU.

Secondly, the number of fully-operational ALUs for processing pixel components seems to be doubled in the NV40. To be precise, two types of floating-point ALUs from the NV35, “fully-operational” and “simplified”, which came instead of the integer ALUs of the NV30, have all transformed into “fully-operational” ALUs that perform operations of any complexity at the same speed. This is how NVIDIA shows the advantages of CineFX 3.0 (right) with the double number of ALUs over “traditional” architectures (left):

Thus, ALUs of the NV40 can perform up to eight operations on the components of one pixel in a clock cycle. Considering that the NV40 pipeline processes 16 pixels at a time, the NV40 core has 32 floating-point ALUs that perform up to 128 operations per clock cycle.

Overall, NV40 seems to be an evolutionary development of the NV3x series, with considerable improvements. Besides “quantitative” improvements, there are “qualitative” ones, too.

Pixel Shaders Version 3.0

The support of pixel shader version 3.0 implemented in NV40 implies the support of dynamic loops and branching in the first hand. Now the decision about which branch should be executed is taken right during the actual execution – the variables that determine the flow of the shader may vary. They are not constants anymore like before when we had static branches and loops.

Clearly, the new functionality of the NV40 won’t show itself in running shaders version 2.0 where only speed characteristics of the pixel processor matter (hopefully, the processor is fast enough, so that we didn’t see slow shader processing of the NV3x again).

There is one problem that can arise with version 3.0 shaders that use dynamic loops and branches. Processing several pixels, the NV40 may encounter a situation when it must execute one branch of the shader for some pixels and another branch for other pixels. How does the pipeline work in this case?

Possible solutions always hit on the performance. For example, if the pipeline meets a branch and starts processing pixels one by one, rather than several pixels at a time, execution units will mostly be idle. Contrary, if both branches of the shader are executed for all pixels, additional computational expenses arise.

We tried to estimate the performance of the NVIDIA GeForce 6800/6800 Ultra when it executes pixel shaders version 3.0 using our own benchmark, however, we did not succeed for some reason. Microsoft DirectX 9 API supports pixel shaders 3.0 since its first release, but the GeForce 6800 Ultra did not run the shader we offered it. We believe, the Shader 3.0 functionality is at least partially disabled in the current driver.

NVIDIA HPDR: Lighting Gets More Realistic

Previous generation graphics processors from NVIDIA didn’t support information output from the pixel shader to a few buffers simultaneously (Multiple Render Targets) and data rendering into a buffer in floating-point representation (FP Render Target). ATI graphics chips family supported these features from the very beginning, which made an advantageous difference from NVIDIA’s solutions.

NV40 has finally acquired full support of the Multiple Render targets and FP Render Target, which allowed the company marketing people to introduce a new term: NVIDIA HPDR. This abbreviation stands for High-Precision Dynamic-Range, i.e. the ability to build a scene with high dynamic lighting range (HDRI, High Dynamic Range Images).

The major idea of the HDRI is very simple: the lighting parameters (color and intensity) of the pixels forming the images should be described with real physical terms. To get what this actually means, you should recall the today’s approach to image description model.

RGB Model and Our Eyes

The today’s universal image description model is an additive hardware dependent RGB (Red, Green, Blue) model, which was first developed for such display devices as CRT (Cathode Ray Tube), i.e. the regular computer monitor. According to this model, any color can be represented as a sum of three basic colors: Red, Green and Blue with properly selected intensities. The intensity of each basic color is split into 256 shades (intensity gradations).

The number 256 is quite a randomly selected one and appeared as a compromise between the computer graphics subsystem performance, photorealistic image requirements and binary nature of all computer calculations. In particular, they found out that 16.7 million shades (256x256x256) are more than enough for images with photographic quality. Moreover, 256 can be easily codes in the binary system as 2^8, i.e. 1 byte.

So, according to the RGB model, black color looks like (0, 0, 0), i.e. there is no intensity at all, while white color looks like (255, 255, 255), which is the maximum intensity possible for all three basic colors.

Of course, any color in RGB model will be described with an integer triad. Note that floating point numbers (such as 1.6 or 25.4, for instance) cannot be used within this model, and the numbers used are kind of “fake”, i.e. they have nothing to do with real physical lighting parameters.

One more interesting feature of the 8-bit intensity representation is its discrete character. The maximum screen brightness of contemporary monitors is known to be around 100-120cd/m^2. If we split this value into 256 shades, we will get about 0.47cd/m^2, which is the brightness interval between the two nearest shades. This way, the monitor brightness is discreet and this sampling rate (which we can also call a threshold of sensitivity to brightness gradients) equals 0.47cd/m^2 if we set the monitor brightness to the maximum, and around 0.4 cd/m^2 is the brightness is set to 70-80%.

On the other hand, the dynamic range of human eye lies between 10^6 and 10^8 cd/m^2, i.e. it makes 100,000,000,000,000:1 or 14 orders. Although human eye cannot see the light from this entire range at the same time: the maximum intensity level visible for a human eye at a time makes around 10,000:1. And since human eyesight tracks the light intensity and color separately, the entire color gamma your eye can perceive makes 10,000 brightness shades x 10,000 color shades, which equals 10^8 colors.

Another important peculiarity of human eyes is the threshold of sensitivity or the minimal change of the lighting intensity perceivable by the human eye (brightness resolution). The value of this threshold depends on the light intensity and grows up as the latter increases. From 0.01 to 100cd/m^2 the dependence of the intensity on the threshold value is constant (Weber’s law) and equals 0.02cd/m^2. In other words, the threshold of sensitivity for 1cd/m^2 light intensity makes 0.02cd/m^2, for 10 – 0.2cd/m^2, for 50 - 1cd/m^2, and for 100 - 2cd/m^2. The remaining part of the intensity range doesn’t follow this rule and the dependence in this case can be described with a more complicated rule.

Of course, the dynamic monitor range (and the RGB model description) is not enough to represent all real world images or at least that part of it, which a human eye can perceive. The typical consequence of that is the “removal” of all intensities from the upper and lower part of the range. An example here could be a room with the open window on a sunny summer day. The monitor will correctly display either the room interior or the part of the outdoor scene, which you can see through the window.

HDR Comes to Replace RGB

Where is the way out then?

As far as the computer monitor is concerned, there is hardly anything you can do about it: you cannot increase the screen brightness up to the level of Sun brightness.

But if there is nothing we could do about the monitor then why don’t we give up the RGB model, especially since it can be done absolutely painlessly. Let’s describe the images with real physical values of light intensity and color, and the let the monitor display all it can, as it will hardly be worse anyway. :) This is exactly the idea behind HDRI: for pixels of the image we set the intensity and color in real physical values or values linearly proportional to them. Of course, all real (and fake) lighting parameters are now described with real numbers and not integers, so that we will not be able to cope with 8 bits per channel. This approach immediately eliminates all limitations imposed by the RGB model: the dynamic image range is not limited at all theoretically. This way the question about discreetness and the number of brightness gradations is no longer acute, and the problem of insufficient color coverage is also solved.

We could state that the introduction of HDRI for the first time allowed separating and making independent the description, as a numeric representation of the image within the HDRI model, and the presentation of this description on any technical display device, such as a PC monitor, ink-jet or photo-printer. This way, the image presentation and display turned into two independent processes, while HDRI description became hardware independent.

The display of HDR image on the monitor or its printout requires transforming the dynamic range and HDRI color range into the dynamic and color range of the output device: RGB for monitors, CMYK for printers, CIE Lab, Kodak CYY and the like. Since all these models are LDRI (Low Dynamic Range Images), this transformation cannot be performed painlessly. This process is known as tone mapping and it uses the peculiarities of the human eye to re3duce the losses during this transformation. Since there is no math1ematical model describing the human eyesight and its mechanisms fully and correctly, there is no general tone mapping algorithm, which could always ensure quality outcome.

Let’s return to the numeric representation of the HDRI description. Infinite dynamic range is a good thing, but the computer cannot process the infinity. That is why in practice, the dynamic range is usually limited from the top and bottom. A good approximation of this limitation is human eye range, i.e. from 10^6 to 10^8. So, we get a dilemma here. On the one hand, the broader is the dynamic range, the better. On the other hand, we should spare some of the computer resources, because bigger range requires more data to describe this image then. In order to solve this problem, they developed a few formats of the HDR numeric image representation, which differ only by the available range and desired size.

NVIDIA NV40 Acquires HDR

NVIDIA uses a compromise variant, the 16-bit OpenEXR format developed by Industrial Light and Magic. The 16-bit OpenEXR description devotes one bit for the sign of the exponent, five bits to store the value of the exponent and ten bits to store the mantissas of the chromatic color coordinates (u, v), five bits per coordinate. The dynamic representation range thus stretches to 9 orders of magnitude: from 6.14*10-5 to 6.41*104.

The process of constructing and outputting a HDR image with the NV40 graphics processor is divided into three steps:

  1. Light Transport: rendering a scene with a high lighting dynamic range and saving the information about the light characteristics of each pixel in a buffer that uses the OpenEXR floating-point data format. NVIDIA stresses the fact that the NV40 supports floating-point data representation on each step of creation of a HDR scene, ensuring the minimum quality loss:
  2. Tone Mapping – translation of the image with a high dynamic range into a LDRI format (RGBA or sRGB).
  3. Color and Gamma Correction – translation of the image into the color space of the display device (CRT or an LCD monitor or anything else).

So, the NV40 with its HPDR technology makes high-dynamic-range images available for admirers of NVIDIA products, not only to owners ?f RADEONs. This is another step to bringing photorealistic graphics into computer games.

Vertex Pipelines, Vertex Shaders 3.0

Enforcing the NV40’s pixel processor, NVIDIA didn’t forget about the “geometrical force” of the new GeForce. New graphics chips have twice more vertex pipelines – six against three in the GeForce FX 5950 Ultra. New games feature the more complex models, huge amounts of polygons per scene, so the doubled vertex-processing performance of the new GPU will surely be put to good use.

In the test section of the review we will try to estimate how fast the new GeForce is in games that demand a high speed of processing geometry data.

The functionality of the NV40’s vertex processors grew along with their performance. NVIDIA claims the new GPU to fully support version 3.0 vertex shaders. Their length is now practically infinite like that of pixel shaders (well, it is limited by the DirectX 3.0 shaders specification) and the shader can now have truly dynamic loops and branches. It is now during the execution of the shader that the choice of what code is executed for the particular vertex is made, rather than during compilation.

Vertex Frequency Stream Divider is an interesting feature of the NV40’s vertex processors. Using this frequency divider, the vertex processors can read data from streams and update the input parameters of the vertex shader not for each processed vertex, as before, but less frequently, with an adjustable frequency.

NVIDIA offers the following example of using this feature: it’s possible to read streaming data (for example animation data) at a definite frequency and use the same model-geometry data, describing a soldier, to create a whole army of such soldiers that wouldn’t be clones of each other, but would each have a unique appearance and animation.

The last but not least in the list, the vertex processors of the NV40 can read and use texture-extracted values during shader execution. The extracted texture sample can be employed for geometry deformation, for instance. This process is known as Displacement Mapping. Such mapping allows for a higher level of realism and interactivity: one or several static or updated displacement maps can be used to create an illusion of changing, “real” water surface.

There’s only one thing I’m sad about: having “trained” the NV40 to work with textures, NVIDIA engineers forgot to introduce support of automatic creation of new vertexes. For example, the vertex processors of Matrox’s Parhelia chip can create new vertexes on their own, basing on the parameters of the existing vertexes, thus making the geometrical description of models more precise. The GPU divides original triangles that describe the model into smaller ones (this process is known as tessellation). The tessellation degree may depend on the distance from the viewer, while the division algorithm of the Parhelia doesn’t produce any sudden changes in the model appearance. This is called adaptive tessellation. You can learn more about displacement mapping and adaptive tessellation in our Matrox Parhelia review.

Clearly, the use of adaptive tessellation with displacement maps can help the graphics processor render more realistic scenes. Regrettably, the new generation of graphics processors from NVIDIA doesn’t support hardware tessellation.

UltraShadow II

First announced in the NV35, the UltraShadow technology, upgraded to the second version, appears in the NV40, too. The point of the technology has remained the same: when processing dynamic shadows with the help of the stencil buffer, it’s possible to set border Z values (“depth bounds”) beyond which the shadows from light sources won’t be rendered. Thus, this technology makes possible some computational economy and performance increase in scenes that use real-time shadows rendering.

NVIDIA illustrates UltraShadow II with the following figure, which shows the boundary values (zmin and zmax), beyond which the stencil buffer is not processed:

The opportunity of indicating boundary conditions for rendering shadows coupled with the known capability of the NV40 to speed up when processing the stencil buffer and the Z buffer (that is, to output 32, rather than 16 values per clock cycle) may bring a significant advantage to the NV40 over competitors in games that use dynamic shadows rendering with the help of the stencil buffer.

The advantages of UltraShadow will only be apparent when game developers use this technology explicitly and identify those depth bounds. As for the ability to output many Z values per clock cycle, the NV40 always has this option on, in any game.

Doom 3 is an example of the game that uses dynamic shadows. Moreover, its gameplay is based on shadows. The gaming community has long been anticipating this potential hit, so NVIDIA’s putting its stake on Doom 3 is a good marketing move, at least.

Programmable Video-processor

The home PC has long been advertised as a universal entertainment center, so the developers don’t have to divide processors for personal computers into categories basing on their typical application. The entire series of chips is usually endowed with the most copious bundle of features.

The NV40 has a programmable video-processor intended for encoding/decoding video streams and performing various operations on them. You may have heard S3 talking about such things with regard to their DeltaChrome chip. ATI’s RADEON 9500/9600/9700/9800 chips can decode video using their pixel processors, too. The video-processor of the NV40 features:

In fact, this video processor is a full-fledged processing device capable of performing scalar and vector calculations as well as executing branches.

The video processor doesn’t seem to be an independent functional unit. It is more likely that the pixel processor of the NV40 bears the load of processing video. This is our supposition, though, as NVIDIA haven’t yet clarified the situation.

As for the efficiency of the NV40’s video processor, NVIDIA claims it takes 60% and more of workload at encoding video into the MPEG-2 format and up to 95% at decoding MPEG-2.

That’s not the only application of the video processor. Its programmable architecture makes it possible to lay special effects onto the picture in real time. Until now, only S3 Graphics with its DeltaChrome could boast this capability.

We think the most useful and interesting functions of the video processor in the NV40 core include hardware MPEG-2/4 encoding, MPEG-4 and WMV9 decoding, and support of HDTV. HDTV is the most resources-hungry format today: modern CPUs and graphics cards have no problems decoding DVD movies (MPEG-2), while the decoding of a HDTV stream may be difficult even for a very fast CPU. Hardware support of MPEG-4 (DivX and XviD) looks promising, too, since this format is widespread today. Again, NVIDIA claims its video processor to be flexible and programmable. In other words, the company can increase the number of supported formats in the future by simply revising the drivers.

Like in our S3 DeltaChrome review, we decided to try the NV40 at decoding MPEG-2/4 and HDTV – we don’t like to rely on the manufacturer’s words only. For the latter format, we used the same video clips as with the DeltaChrome, written in the HDTV WMV format. Alas, our testing ended without really starting out. When trying to play any video with BSPlayer, we saw a blank gray window, although with sound. It turned out that the NV40 with the current version of its drivers didn’t support the Overlay Mixer mode. The problem disappeared when we switched to the VMR-9 mode, but only to be followed by another. When playing a HDTV WMV video clip with Windows Media Player 9, we had the central processor loaded to the full 100%, and the video went on convulsively, although the image quality was all right.

The CPU workload was much high above NVIDIA’s claims during MPEG-4 playback, too. All the time a 640x480 video clip encoded with the DivX codec was running, the central processor was loaded by 60-65%, which is too much for such a video file. For comparison, the RADEON 9800 XT played the same clip, loading the system by 35-40% only.

DVD playback was all right, save for the same too-high CPU load: about 30% in both WinDVD and in Windows Media Player.

These results make us think that the current version of the ForceWare driver has the support of NVIDIA’s video processor disabled. The developers may have encountered problems with this unit and put it off for a while to avoid stability issues. This is unlikely to be a hardware problem, since video processing seems to be performed by the ordinary ALUs and if the ALUs were defective, the entire NV40 core wouldn’t function properly. One way or another, the promised video-stream-processing capabilities of the NV40 remain on paper so far.

Anyway, we hope that this is a temporary situation and the new ForceWare version will come with the video processor support. If it does, we’ll surely test it and evaluate it as it deserves.

Full-Screen Antialiasing: the New Quality

Rotated-Grid Multisampling in Theory

ATI’s R3x0 GPUs and NVIDIA’s GeForce FX chips take different approaches to perform full-screen antialiasing (FSAA). The GeForce FX supports multisampling, supersampling and their combinations, placing pixels – as the tradition demands – on an ordered orthogonal grid, while graphics processors from ATI perform multisampling, placing sub-pixels on a rotated grid.

The rotated-grid FSAA method means a higher quality of smoothing the edges of polygons at the same computational cost.

The next illustration shows you roughly a pixel and its sub-pixels in two variants, on an ordered and on a rotated grid.

There can be several cases: no sub-pixel of a given pixel falls within a given polygon; one sub-pixel is in the polygon; two sub-pixels; three sub-pixels; or all four sub-pixels. The resulting color may take in 0%, 20%, 50%, 75% or 100% of the pixel’s color, as computed, for instance, in a pixel shader.

The two methods with their different placement of sub-pixels will actually produce the same results in most common cases, but this uniformity ends as soon as we deal with lines that are nearly horizontal or vertical. The traditional ordered-grid method smoothes such lines worse: the probability of one or three pixels getting beyond the polygon edge when the edge is nearly horizontal or vertical is much smaller than the probability of two or four sub-pixels getting there. Thus, most pixels will get 0%, 50% or 100% of the color calculated in the shader. That is, the 4x-level antialiasing produces the same results under such unfavorable conditions as the presumably lower-quality 2x-level one.

Doing the same antialiasing on a rotated grid we have about the same probability of getting 0%, 25%, 50%, 75% and 100% of color on nearly-horizontal and nearly-vertical lines; so this method gives a more satisfying result. By the way, pixels on our monitors are placed on an ordered orthogonal grid, too, and the annoying “jaggies” are most visible and most annoying exactly when the edges of polygons are near horizontal or vertical. This fact makes the rotated-grid FSAA method preferable over the ordinary ordered-grid one.

Graphics processors from ATI that use this rotated-grid FSAA used to provide a higher quality of jaggies-smoothing than NVIDIA chips, but at the same computational cost. The NV40 is going to change this situation as it boasts an improved FSAA method. Unlike graphics processors of the GeForce FX series, the NV40 places sub-pixels on a rotated grid. The following figure shows the placement of sub-pixels in the 4x FSAA method on the NVIDIA GeForce FX 5950 Ultra (left) and GeForce 6800 Ultra (right):

Using the rotated-grid method for 2x and 4x multisampling, the NV40, like the GeForce FX series processors, can mix up supersampling and multisampling. Thus, there are more variations of FSAA, besides 2x and 4x methods.

Rotated-Grid Multisampling in Practice

We launched Max Payne 2 for checking out the quality of FSAA of NVIDIA GeForce 6800 Ultra, GeForce FX 5950 Ultra and ATI RADEON 9800 XT graphics cards:

The red rectangles mark image fragments that we’re interested in. They have polygon edges positioned at different angles and one fragment (low right) has a transparent texture.

So, let’s first have a look at nearly-horizontal lines. We will use 2x, 4x and 8x modes for graphics cards on GPUs from NVIDIA and 2x, 4x and 6x for the RADEON 9800 XT.

You see the image as produced and antialiased by the GeForce 6800 Ultra to the left; the work of the GeForce FX 5950 Ultra is presented in the middle; the RADEON 9800 XT shows its FSAA skills to the right:

GeForce 6800 Ultra GeForce FX 5950 UltraRADEON 9800 XT
2x
4x
8x/6x

Now, let’s examine a portion of the scene with near-vertical polygon edges. As before, the images we got with the GeForce 6800 are to the left; the GeForce FX 5950 Ultra shows its work in the middle; the RADEON 9800 XT outputs its images to the right:

GeForce 6800 Ultra GeForce FX 5950 UltraRADEON 9800 XT
2x
4x
8x/6x

Everything we said about the previous case (with nearly-horizontal polygon edges) can be repeated once again. The only difference from the previous situation is that both leaders, the GeForce 6800 and RADEON 9800 XT, are doing much like each other in the 4x mode – it’s hard to prefer any of them.

So:

Let’s test some intermediary angle of the polygon edge. As always, the results of the GeForce 6800 are to the left; those of the GeForce FX 5950 Ultra are in the middle; and images of the RADEON 9800 XT are to the right.

GeForce 6800 Ultra GeForce FX 5950 UltraRADEON 9800 XT
2x
4x
8x/6x

Now, we’ve got only one case left, with a transparent texture. That’s a good indicator of supersampling and of its role in the hybrid 8x mode.

The indicator works simply: with multisampling, sub-pixels are not processed for pixels that don’t fall onto polygon edges and the “transparency borders” won’t be smoothed out. If supersampling is used in any way, the “transparency borders” will be antialiased as some sub-pixels will be processed for each pixel in the scene.

Again, the GeForce 6800 Ultra is to the left; the GeForce FX 5950 Ultra is in the middle; the RADEON 9800 XT is to the right.

GeForce 6800 Ultra GeForce FX 5950 UltraRADEON 9800 XT
2x
4x
8x/6x

The hybrid mode has both strong and weak points. On the one hand, supersampling in this method smoothes out transparent textures and serves like an additional method of anisotropic filtering (when sub-pixels are processed for each pixel, the projection or “footprint” of the pixel on the texture is calculated with more precision). Multisampling cannot do that. On the other hand, the additional computations make the hybrid FSAA method a resource-consuming solution, much more so than multisampling with the same number of samples.

Now, winding up the section about full-screen antialiasing, we should acknowledge the improvement of 4x FSAA in the GeForce 6800 Ultra thanks to placing sub-pixels on a rotated grid. The new chip from NVIDIA is no inferior to the RADEON 9800 XT in this FSAA mode.

Other FSAA modes came over to the NV40 from the NV38 without any changes, so you shouldn’t wait for anything new from the GeForce 6800 Ultra in this respect. On the other hand, the performance of the NV40 is definitely better than that of NVIDIA’s previous-generation GPUs. That’s why the high-quality, but heavy on the graphics card, “hybrid” 8x FSAA mode may become quite acceptable in situations where the performance of the NV38 was not enough.

Full-Scene Antialiasing Performance Impact

Now let us take a look on how full-scene antialiasing impacts performance of the NVIDIA GeForce 6800 Ultra when enabled. It worth to say that we intentionally decided to use 1600x1200 resolution since in 1024x768 resolution there are absolutely no performance drops when FSAA is switched on.

Both OpenGL games we used in our analysis behaved identically with enabled FSAA. Basically speaking, 2x antialiasing does not impact performance even in 1600x1200 resolution – thanks to outstanding fillrate of the NV40, while 4x antialiasing reduces speed, but not that tangibly. While none of you are probably going to enable FSAA 8x in 1600x1200 resolution, we still checked it out only to find out that even the GeForce 6800 Ultra cannot handle this mode properly from speed standpoint.

Performance in Direct3D games with FSAA enabled is absolutely exceptional. Modes 2x and 4x do not impact performance at all in Unreal Tournament 2004, while in Max Payne 2 the GeForce 6800 Ultra delivers 140fps with FSAA 4x which is more than playable. FSAA 8x still puts even the world’s most powerful graphics card on its knees, though, we still see that even in this mode some games will perform quite fast even in the highest resolutions.

To put it short, the new full-scene antialiasing approaches NVIDIA implemented into its new NV40 graphics processors resulted in exceptional speed and image quality. Generally, MSAA is “free” on the GeForce 6800 Ultra for the majority of games in the majority of resolutions, except 1600x1200, when antialiasing affects speed, but very slightly. Perhaps, hybrid FSAA 8x looks pretty negatively, however, keeping in mind the astonishing power of the graphics processors, some games will run pretty well in that mode, especially in relatively low resolutions, and the highest-quality antialiasing of NVIDIA’s GPUs will be utilized by consumers.

Anisotropic Filtering: New Method, Old Optimizations

We were surprised to find a new method of anisotropic filtering (AF) implemented into the NV40. Why change or invent something new when their existing AF method, realized in GeForce FX series chips, gives good quality and high performance?

The developers of the NV40 have another opinion. Without long descriptions, we just want to show you screenshots of a scene from 3DMark03, which is specifically designed for estimation of the texture-filtering quality. The screenshots were taken at the highest anisotropy level; the scene used a combination of tri-linear and anisotropic filtering and MIP levels were highlighted. The screenshots for the GeForce 6800 Ultra are to the left; for the GeForce 5950 Ultra – in the middle; for the RADEON 9800 XT – to the right:

GeForce 6800 Ultra GeForce FX 5950 UltraRADEON 9800 XT
2x
4x
8x
16x

The pictures produced by the GeForce 6800 Ultra and RADEON 9800 XT are surprisingly similar-looking! The NV40 seems to be using the same or similar AF method that graphics processors from ATI employ, with all its highs and lows. This method is fast and good on “convenient” angles, but there are also “inconvenient” angles of textures when the level of anisotropy is reduced and the quality of textures degenerates.

Implementing the new AF method, the developers didn’t forget about the optimizations that came along with the older method like simplification of tri-linear filtering, texture sharpness analysis, reduction of the anisotropy level and forced texture compression. The screenshots show that the RADEON 9800 XT uses true tri-linear filtering along with anisotropic one in the highest-quality mode, while the GeForce 6800 Ultra uses a simplified form of tri-linear filtering.

To see the influence of those optimizations on the quality of textures, we took the same scene and varied quality settings in the graphics card drivers.

GeForce 6800 Ultra, tri-linear filtering, modes from right to left: “High Quality”, “Quality”, “Performance”, and “High Performance”:

High QualityQualityPerformanceHigh Performance

It seems like we’ll never again see true tri-linear filtering performed by NVIDIA’s graphics processors: the GeForce 6800 Ultra uses a combination or bi- and tri-linear filtering even in the “High Quality” mode.

The same modes (“High Quality”, “Quality”, “Performance”, “High Performance”) with 16x anisotropic filtering:

High QualityQualityPerformanceHigh Performance

The positions of the borders of the MIP levels didn’t change – that is, the level of anisotropy remained the same. Well, there should have been no reduction of AF level since the texture in the scene is sharp – the algorithm that thinks over the possibility of reducing the AF level doesn’t allow the graphics processor to be lazy in this case.

The borders of the MIP levels become sharper as we go from “High Quality” to “High Performance” until tri-linear filtering disappears completely.

The last scene is from the OpenGL game “Serious Sam: The Second Encounter”. When taking the screenshots, we selected the maximum graphics quality settings in the game, controlling anisotropic filtering through the driver’s control panel.

So, the conditions are the same: the NVIDIA GeForce 6800 Ultra GPU performs 16x anisotropic filtering, with tri-linear filtering and MIP-levels highlighting enabled. The four screenshots refer to the four driver settings: “High Quality”, “Quality”, “Performance”, and “High Performance”.

High QualityQualityPerformanceHigh Performance

There’s little difference between the images of “High Quality” and “Quality” modes. Only the remotest MIP-levels bear some traces of divergence.

When we switch to the “Performance” mode, the border lines between MIP levels become sharper. The non-colored stripes that correspond to the second texture layer become closer to the viewer. This means that the anisotropy level is reduced for the second texture layer.

The anisotropy level is reduced further for the second texture layer in the “High Performance” mode and the scene at large loses its original appearance due to forced texture compression.

Thus, the NV40 core has inherited all the set of optimizations we know from GeForce FX series chips. Is it good or bad? Well, there’s nothing wrong about it as you can always have the highest-quality picture by choosing “High Quality” in the driver settings. But if the performance of the NV40 is not enough for a given game (this scenario seems fantastic today), you can sacrifice the visual quality for more speed.

The new AF method realized in the NV40 is practically identical to the one ATI Technologies uses in its chips, so all known pros and contras of ATI’s anisotropic filtering can now be applied to the GeForce 6800 Ultra. By the way, that’s why we don’t compare images without highlighted MIP levels as the GeForce 6800 Ultra and the RADEON 9800 XT output textures of the same quality, while the characteristic artifacts of the simplified tri-linear filtering of the NV40 are rarely perceptible and you can only see them in dynamic scenes.

Anisotropic Filtering Performance Impact

Let’s now estimate the performance of the GeForce 6800 at performing AF. We chose the highest anisotropy levels for the graphics cards:

The performance hit experienced by the NV40 after we enable anisotropic filtering is at a level with the competitors, at worst. In best cases, the graphics card doesn’t slow down at all as its performance is limited by the central processor’s speed.

Theoretical Tests

Each time we meet a new graphics processor architecture, we run a full round of synthetic tests to expose all its properties.

Synthetic Tests: Fill Rate

Fillrate Tester opens our cycle of synthetic tests by measuring the scene fill-rate speed and the speed of executing pixel shaders.

Let’s measure the texturing speed of the new graphics core, in the simplest case. The graphics processor’s standard operation mode: color writes and Z writes are enabled.

That’s predictable: the GeForce 6800 Ultra can process 16 pixels at once and thus shows the highest results in this test. Without textures, the NV40 is close to the theoretical maximum. When there’s one texture, the efficiency of the core drops down, but smoothly improves as there appear more textures, i.e. when there are more clock cycles necessary to process groups of 16 pixels. The reduction of efficiency at mapping one texture (i.e. at the maximum pixel output speed) is probably due to the insufficient speed of writing pixels and Z values into the frame buffer, that is, due to the low memory bus bandwidth. A simple calculation tells that when we use 32-bit textures, a 32-bit frame buffer and Z-buffer and map one texture, the GeForce 6800 Ultra has to write into the graphics memory (into caches, to be precise, and then into the memory), 128 bytes of data per clock cycle or about 52 gigabytes per second, while the bandwidth of its memory bus is “only” 32GB/s. When there are more textures to be mapped, the processor performs memory writes less frequently – the memory bus bandwidth doesn’t play the limiting role then and the NV40 performs close to its theoretical maximum.

Z writes disabled, we see no surprises: the NV40 is more than two times faster than its competitors. And again, the stretch between no-textures and one-texture points is steeper with the NV40 than with the GeForce FX 5950 and the RADEON XT. But with two textures, the GeForce 6800 Ultra again shows a higher efficiency than the graphics processor of the last generation.

Disabling color writes we switch the NV40 into a mode where the graphics processor calculates 32 Z values per clock cycle, that is, the fill-rate doubles. However, the results don’t comply with the theory – the NV40’s fill-rate should be about 12,800 MPixels/second, while the test produces results around 20,000 MPixels/second. By some reason, the GeForce 6800 Ultra shows higher results, than the theoretical maximum!

Synthetic Tests: Pixel Shaders

Using the same Fillrate Tester we’ll see the NV40 executing pixel shaders of a varying degree of complexity.

These are excellent results, don’t you agree? NVIDIA’s claims about a considerable improvement of the pixel shader speed seem to be true to life.

What’s curious, the GeForce 6800 Ultra enjoys a performance gain from working with the 16-bit precision, which is even bigger than the GeForce FX 5950 Ultra has. When talking about a faster pixel shader execution, NVIDIA didn’t mention the problem of half-precision calculations, and that made us think that the NV40 wouldn’t have any performance gains from reducing the precision. Practice shows, however, that the pixel processor of the NV40, already accelerated in comparison to the NV38, can get an additional boost from switching to the low calculation accuracy.

It’s also curious that the NV40 executed the simple version 1.1 pixel shader slower than the simple version 2.0 shader, unlike the NV38.

The situation remained the same when we disabled Z writes. So we have nothing to add to the above words.

When we disabled color writes, the NV40 only started outputting 32 Z-values per clock cycle with the simple shaders, but had a smaller result with the complex version 2.0 pixel shaders. Well, this behavior is peculiar to all the graphics cards, it is just more conspicuous with the NV40.

Right now we’re working in our test labs on a project of a complex test that would allow evaluating modern DirectX 9-compativle graphics cards. We used Xbitmark in this review to check out the performance of the graphics cards at executing DirectX 9 pixel shaders of varying complexity.

The NV40 has an easy win over the ex-champion RADEON 9800 XT, which is especially clear with the sophisticated “27-Pass Fur” shader. The new chip outperforms the RADEON by 1.5-3 times nearly everywhere! It really sets new standards for PC graphics hardware.

Next go the tests of the pixel shader speed from 3DMark suites:

The performance of the CPU and the system at large limits the speed of the graphics cards in the lowest resolutions in 3DMark2001, but in higher resolutions the NV40 shows higher results than its adversaries. In 1600x1200, the GeForce 6800 Ultra is 2.5 times faster than the GeForce FX 5950 Ultra and 1.5 times faster than the RADEON 9800 XT.

The GeForce 6800 Ultra is good at executing version 1.4 pixel shaders, too, leaving behind the ex-leader RADEON 9800 XT.

The GeForce 6800 Ultra is the best at dealing with version 2.0 pixel shaders and the gap is astonishing. This performance promises excellent results in modern shaders-heavy games.

Now let’s check out vertex processors.

Synthetic Tests: Vertex Processors

There’s no need to comment: the NV40 is unrivalled and its high fill rate and fast memory subsystem allow it to have a smaller performance hit in higher resolutions than its competitors have.

The GeForce 6800 Ultra is the fastest, too, at executing version 2.0 vertex shaders.

Synthetic Tests: T&L

In fact, all benchmarks of the execution speed can be now regarded as measuring the vertex shader execution speed, since all modern graphics processors translate commands of the classic T&L into appropriate vertex shaders. We’ve got curious results:

The GeForce 6800 Ultra is faster than the other cards at processing a one-light-source scene, but the gap is smaller than in the pixel shader tests.

The GeForce 6800 Ultra remains the leader in the test with eight light sources, but there’s even smaller difference between the results of the NV40 and the NV38. That is, having six vertex processors and, potentially, twice higher geometrical performance than the NV38, the new NV40 core is only a little faster! This signals about a less efficient translation of T&L instructions into shaders in the NV40. This problem may be amended in the future version of the driver as the NV40 shows very satisfying results in all other geometry tests.

Synthetic Tests: Relief Imitation, Sprites Speed and more

Here’s another surprise for you: the NV40 is slow at processing point sprites, although doesn’t depend on the resolution. The latter fact again makes one think that the ForceWare 60.72 driver doesn’t work well with the new GeForce.

The GeForce 6800 Ultra doesn’t cope with the seemingly simple task of laying relief with the EMBM method! The new chip shows an acceptable result in 1600x1200 only, but that’s only due to the super-fast memory subsystem or some factor that limits the NV40 results. The problem seems to root in the driver, too.

The new GeForce feels better at rendering relief with the Dot3 method. Its results are definitely limited by some system bottleneck, not by the GPU performance. In higher resolutions, the GeForce 6800 Ultra is 1.5-2 times faster than its competitors.

The Ragtroll test shows the overall balance of the “CPU-driver-GPU” system and it confirms the superiority of the NV40.

Overall, the NV40 architecture and the GeForce 6800 Ultra card have a tremendous advantage over the GeForce FX 5950 Ultra and RADEON 9800 XT. However, there are some troubles, particularly with EMBM relief rendering. They are most probably due to the unrefined driver (we used the ForceWare 60.72 driver in our tests). The rest of the tests agree that the NV40 is a really next-generation graphics processor that lifts us up to the new levels of performance and functionality.

We’d also like to stress the fact that the abominable speed of the NV40 doesn’t have anything to do with application-specific optimizations. This is the raw speed of the new architecture only. We are absolutely sure about that as Xbitmark is a test that never left our laboratories and thus couldn’t have been optimized for.

The results shown by the GeForce 6800 Ultra in synthetic tests promise a brilliant show in modern games. Let’s get to our gaming tests, then!

Gaming Tests

The tested the sharpness of NVIDIA’s new weapon in the following testbed:

The new chip is going to be the premium offer in NVIDIA’s product range, so we took worthy rivals to it: GeForce FX 5950 Ultra and RADEON 9800 XT. Regrettably, our overclocking the GeForce 6800 Ultra was hardly a success as the card only worked stable at 435/1150MHz frequencies (GPU/memory). The memory chips didn’t even reach the frequency that they are rated for, 1200MHz. Well, the NV40 is a complex chip, so there’s no wonder its shows low overclockability.

The NV40 supports 16x anisotropic filtering (AF) and we enabled this anisotropy level for it as well as for the RADEON 9800 XT. The GeForce FX 5950 Ultra performed AF in the 8x mode, the maximum available anisotropy level for this card.

The following games and benchmarking suites are to become the battlefield for the three giants of 3D graphics to meet at:

There are two new names in the list, marked with bold type. The first of them, Firestarter, is a 3D shooter from a Russian developer. The game itself is no revelation, but rather a well-made mainstream product, and it suits for testing purposes due to excellent scalability. There’s only one problem. The game uses 1600x1024 resolution instead of 1600x1200 and the RADEON 9800 XT refused to work in this mode. So we only post two results for this graphics card in Firestarter, in 1024x768 and in 1280x1024.

Lock On is a regular air-combat simulator, differing from the renowned IL-2 Sturmovik in the background: you fly present-day MiG-29 and Su-27 machines rather than the planes of World War 2. The game features beautiful graphics, a realistic flight model, and multilayer volumetric clouds that can pull down the fps rate, so a powerful graphics card and a fast CPU are a must for this game, just like it is the case with IL-2, though.

As usual, we set up the highest graphics quality settings in each game. During our tests, we encountered two cases of incorrect image reproduction, in Far Cry and Command & Conquer: generals, we’ll discuss them in the corresponding sections of the review.

In the near future, we are going to replace the demo versions of Unreal Tournament 2004 and Far Cry with their full versions to provide a more accurate picture of how modern graphics cards perform in real games.

3D First-Person Shooters

RTCW: Enemy Territory

This game is no problem for top-end graphics cards like the ones included into this review – the results are greatly impeded by the performance of the central processor. However, you may see that the GeForce FX 5950 Ultra and the RADEON 9800 XT slow down in higher resolutions, while the GeForce 6800 Ultra goes through all the modes at the same speed.

Anisotropic filtering (AF) and full-screen antialiasing (FSAA) are resource-consuming features, but less so for the new GPU from NVIDIA: it is faster than the competitors by a half in the lowest resolution and in double in the highest resolution. The overclocking of the GeForce 6800 Ultra only brings some effect in the highest resolution.

Call of Duty

Call of Duty resembles RTCW: Enemy Territory in gameplay, but uses a more advanced engine with pixel shaders support. When no advanced graphics features are enabled in 1024x768, performance in Call of Duty is limited by central processor. However, since the game is pretty heavy even for high-end graphics cards, the GeForce 6800 Ultra starts to show its muscle already in resolutions of 1280x1024 and higher.

The workload gets higher with FSAA and AF turned on, but the GeForce 6800 Ultra doesn’t seem to feel it even in the highest resolutions. The RADEON 9800 XT, efficient at executing pixel shaders, takes the second place.

Unreal Tournament 2003

That’s the triumph of the GeForce 6800 Ultra – the card is so fast that the other two look like mainstream solutions against a top-end product. Just a little while ago, they have been considered the fastest and topmost solutions themselves...

The superiority of the new NVIDIA GPU seems less overwhelming on the Antalus level, but only because its results are again limited by the CPU performance in the lowest resolution. In 1280x1024 the situation resembles what we had on the Inferno level, and the oldies feel even worse in 1600x1200.

Again, the GeForce 6800 Ultra handles playfully the bigger workload, while its competitors are having a tough time. The new GPU provides excellent playability up to 1600x1200 with a considerable fps reserve. Overclocking brings a tangible advantage.

The above words can be said again about the Antalus level.

Unreal Tournament 2004 Demo

The demo-version of the new Unreal Tournament is heavier on the graphics card than the 2003 version and the central processor only becomes a bottleneck in the lowest resolution. The both cards of the last generation slow down quickly, while the GeForce 6800 notches the same results in all resolutions, being hindered by the central processor.

It’s all the same on the Bridge of Fate level, but the results are overall higher since this level is simpler – the action goes on indoors.

The GeForce 6800 Ultra keeps up a high speed with FSAA and AF turned on, providing playability in all resolutions.

Again, the GeForce 6800 Ultra shows excellent results on the Bridge of Fate level.

Halo: Combat Evolved

There’s nothing to comment much upon. NVIDIA’s new chip shows a very high fps rate for the demanding game Halo is. Now you can play Halo with true comfort.

Tron 2.0

The performance of the GeForce 6800 Ultra is again limited by the CPU speed; the two other cards quickly give up before this game.

And again we see an excellent result in the FSAA+AF mode: the fps counter never goes below 50. Well, the RADEON 9800 XT is good in this relatively simple game, too.

Highly Anticipated DirectX 9 Game 1

This next-generation game used to be a crucible for all NVIDIA’s graphics processors. That’s amazing, but the new GPU doesn’t have any elbowroom now as its performance again presses against the CPU speed ceiling. The RADEON 9800 XT was the etalon of performance in this test, but it cannot reach the level of the NV40.

The second level that we use for tests in this game is less complex and less CPU-dependant. The GeForce 6800 Ultra spreads its wings here; it’s faster than the ex-leader in more than two times! Owners of NV40-based graphics cards will surely have a comfortable play when the final version of this game comes out.

Highly Anticipated DirectX 9 Game 2

This beta version of another next-gen game also shows what the GeForce 6800 Ultra is capable of. Evidently, the final version of the game, with various optimizations, will run on the NV40 even faster.

The GeForce 6800 Ultra is on top in the FSAA+FA mode.

The same goes for the Escape level. The diagrams show everything – there’s nothing to comment upon.

The same goes for the eye candy mode with its full-screen antialiasing and anisotropic filtering.

Far Cry Demo

If it were only about the performance, we’d again state that the GeForce 6800 Ultra is unrivalled. Alas, we met image quality problems in this game:


Reference

  
NV40

There’s no shader water, no EMBM relief, but there are some problems with texturing and lighting. A large portion of the scene would become blue from time to time, like it was with the S3 DeltaChrome. In all other games, save for Command & Conquer: Generals (see below), the GeForce 6800 Ultra outputs a picture of excellent quality, without any artifacts. Probably, the ForceWare version 60 driver is to blame for that.

We saw the same problems in the full version of Far Cry; hopefully, the driver updates will do away with them.

Firestarter

The GeForce 6800 Ultra is twice faster than the last-generation GPUs in this game.

The same goes for the eye-candy mode.

3D Third-Person Shooters and Arcade Games

Tom Clancy’s Splinter Cell

It is the first time that a GPU from NVIDIA outperforms the top-end RADEON in Splinter Cell! They had to redesign the architecture and develop a whole new processor for that. :-)

Tomb Raider: Angel of Darkness

The GeForce 6800 Ultra wins this test over the once-invincible RADEON 9800 XT.

Now you can play this game with FSAA and AF enabled even in 1600x1200 – the new chip from NVIDIA permits.

Prince of Persia: Sands of Time

The GeForce 6800 Ultra is the right graphics card to run this beautiful and resources-hungry game. The new GPU from NVIDIA is nearly twice faster than the RADEON 9800 XT.

Max Payne 2: The Fall of Max Payne

Unlike the competitors’ results, those of the GeForce 6800 Ultra are again limited by the performance of the central processor in our testbed.

NVIDIA’s new chip feels excellent in this game even if you turn on full-screen antialiasing and anisotropic filtering.

Simulators

IL-2 Sturmovik: Forgotten Battles

Flight sims are processor-dependant games by their very nature. They require processing of a complex physical flight model, which is performed by the CPU. So we only see some difference in the results of the cards in high resolutions where the speed of the graphics card, rather than that of the CPU, is more important.

Strangely enough, the GeForce 6800 Ultra lost its first position to the RADEON 9800 XT starting from 1280x1024 resolution in the eye-candy mode. What’s the reason? It seems like the driver is again to bear the blame.

Lock On

The simulator of modern air combat is insatiable. With all its graphics quality settings at the maximum, it brought down even the GeForce 6800 Ultra: 27 fps is below the playability level, although the game runs quite smooth visually. The GPUs of the previous generations are helpless: they transform this simulator into a turn-based strategy.

Although the GeForce 6800 Ultra is faster than the others, it cannot provide playability in the high-quality mode of this game.

X2: The Threat

The fans of the X universe may rejoice as they can now play their game comfortably in resolutions above 1024x768. Moreover, we have a performance record here: 100 fps!

The performance of the GeForce 6800 Ultra is high enough for you not to watch a slide-show after turning on full-screen antialiasing and anisotropic filtering.

F1 Challenge 99 – 2002

There’s nothing new: the GeForce 6800 Ultra is impeded by the central processor, while its rivals are not strong enough to reach this barrier.

The GeForce 6800 Ultra is on top in the eye candy modes, too.

3D Real-Time Strategy Games

C&C Generals: Zero Hour

That’s the first fiasco of the new graphics processor as it is much slower than its competitors (see below).

The GeForce 6800 Ultra is faster than the others in the hard modes, but there’s no sense in talking about performance, since we have problems with the picture quality. The problems are somewhat similar to what we saw in Far Cry, but show up very oddly: the ordinary version of Zero Hour looks like a special Christmas edition:


Reference

  


NV40

As you see in the screenshot, the sandy desert has transformed into a snowy desert, with all oases buried under a thick layer of snow. Only the forlorn palms and the lack of flares on the water remind us that this is a problem of the graphics card, not a holiday edition of Command & Conquer. Hopefully, the next driver version will do away with this problem.

Semi-Synthetic Game Tests

Final Fantasy XI Official Benchmark 2

The performance of the NV40 hits against the CPU speed ceiling. Overclocking results in a nice and round score of 6800 points.

The GeForce 6800 Ultra confirms its superiority under higher workloads.

Aquamark3

Aquamark3 is another test where the GeForce 6800 Ultra feels itself at ease. Interestingly, overclocking brings about a hefty performance gain in 1600x1200 – about 10 fps.

The above-mentioned phenomenon shows more clearly in the FSAA+AF mode – the overclocked version of the GeForce 6800 Ultra leaves the non-overclocked one far behind. This may be due to some feature of the Aquamark3 engine, or again to the miracles of the ForceWare 60.72 driver.

Synthetic Tests

Futuremark 3DMark03 build 340

Two years have passed since 3DMark03 appeared and it is only now that we see a graphics card scoring more than 10,000 points in this benchmark. The overclocked GeForce 6800 Ultra pushed the bar to 12,000 points and higher!

The GeForce 6800 Ultra has firmly settled on the top position. Overclocking is rewarding in both operational modes.

Game 2 and 3 tests from the 3DMark03 suite don’t use the most optimal algorithm of scene rendering, so they are hard for any modern graphics card, seldom running at more than 30-40 fps rate. But now we have the new GPU from NVIDIA that can make these tests playable!

Yes, if it were a real game, you could play it comfortably with the GeForce 6800 Ultra, although in the pure performance mode – the non-optimized rendering algorithms of the second and third game tests hinder performance anyway.

The fourth gaming test from the 3DMark03 suite confirms the belonging of the NV40 to the next generation of graphics processors. It is especially obvious from the results with FSAA and AF.

Thus, the new chip from NVIDIA, the GeForce 6800 Ultra, showed a highest level of performance in all gaming tests, leaving its rivals far behind. The new chip is ideal for playing upcoming, next-gen games where expensive RADEON 9800 XT and GeForce FX 5950 Ultra stumble and falter.

Of course, there’s no sense in using the GeForce 6800 Ultra in low resolutions or without enabling full-screen antialiasing and anisotropic filtering as the game speed will be most likely limited by the CPU performance. Only some super-difficult games like Lock On are an exception to this rule.

We are reserved in our praises to the GeForce 6800 Ultra, though, as it produced visual artifacts in a couple of games. These driver-related problems will probably disappear when the new GPU series from NVIDIA starts shipping in mass quantities, but they are here so far and they prevent us from calling the new GeForce an absolutely perfect solution.

Conclusion

Today we had a chance to meet NVIDIA’s newest graphics processor formerly code-named NV40. Traditionally, this chip will be the ancestor of a new family of graphics processing units from the Santa Clara, California-based company. Without any doubts, the debut of the GeForce 6800 Ultra is a massive success for NVIDIA – we are very impressed by its performance and are confident that the new chip is the fastest graphics processor for gamers we have tested so far.

Synthetic tests we ran on the NVIDIA GeForce 6800 Ultra revealed that there are no dramatic drawbacks in the architecture of the chip. We are pretty optimistic about the balance of this architecture, its ability to scale in the future as well as performance the graphics processor is able deliver in today’s and tomorrow’s applications. Benchmarking of the real-world games just confirmed the theory – the NVIDIA GeForce 6800 Ultra demonstrated spectacular speed even in the most complex and high-quality modes.

Some disputes may begin concerning the new approach of anisotropic filtering NVIDIA employed into its novelty, nevertheless, we think that in overall the decision was right. Being very fast and not affecting performance too negatively, the new anisotropic filtering pattern is able to deliver excellent image quality in the majority of the cases. Improvements in full-scene antialiasing techniques are considered strictly positively as with breath-taking performance improvement we got fabulous image quality improvement too, something that we constantly expect from new generations of graphics processors.

Nevertheless, NVIDIA GeForce 6800 Ultra is not only an impressive performaner in 3D games, but also has such advantages as:

Bearing in mind the incredible performance NVIDIA GeForce 6800 Ultra delivers in absolutely all types of games, including titles with plethora of Pixel Shaders 2.0, as well as support for promising Pixel Shaders 3.0, graphics cards based on the new NV40 chip have all chances to become the best choice of the Spring 2004, especially while the competitor – ATI R420 – is still under wraps and its performance is unclear.

NVIDIA claims that the NV40 architecture was developed with industry requirements as well as software developers’ requests in mind. In case this is correct, the company has all chances to receive multitude of positive feedback about the GeForce 6800 Ultra processor and its architecture. As a result of this, capabilities of the novelty are very likely to be utilized in future games and professional applications.

What is very important for us is that performance demonstrated by NVIDIA GeForce 6800 Ultra is something that is achieved by using the complex of hardware and software, but not application-specific optimizations. Our own benchmark displays overwhelming speed improvement of the GeForce 6800 Ultra over the previous generations of chips. Given that Xbitmark is something that is available only in our labs, we are safe to say that NVIDIA ForceWare drivers contain no app-specific optimizations for this test and the power of NVIDIA GeForce 6800 Ultra is something absolutely real.

Obviously, NVIDIA GeForce 6800 Ultra has its drawbacks, but not a lot of them:

Nevertheless, such issues are resolvable on the whole.

Massive cooling system can be substituted with something more compact by actual makers of graphic cards. IBM has all what it takes to tweak its fabrication process to increase the yield and possibly overclockability if the NV40 chips.

Software development for the new graphics chips became a tricky task once GPUs became very complex and feature-rich. There are no surprises that certain games have image artifacts, video processor is not yet enabled and Pixel Shaders 3.0 do not function properly. Bugs and driver issues are inevitable nowadays and considering that the GeForce 6800 Ultra is not yet available, it makes no sense to blame developers for certain drawbacks with the software. Hopefully, once the GeForce 6800 Ultra hits the stores at $499 price-point, everything with the drivers will be resolved. On the whole, at this point we are pretty satisfied with the ForceWare 60.72.

To sum up, NVIDIA GeForce 6800 Ultra is the world’s fastest graphics card with plethora of “beyond-DirectX 9.0” capabilities that is able to satisfy your needs today and tomorrow. Probably one of the best things around money can buy!

We are now looking forward for games to deploy all the spectacular features the new graphics processors have...