Some time ago we got to know that AMD was preparing the launch of a new processor core, then called K8 (K7 - Athlon). Few people knew anything about the core then. Time brought about the details, one of which was quite shocking: K8 would be a 64-bit processor. The shock was due to the fact that the only 64-bit processor then was Intel Itanium. And that monster was made of a huge amount of transistors, had a non-x86 architecture and cost a fantastic sum. So, we asked if AMD was working on something like that? In time we got the answer: absolutely not. The K8 processor, named Hammer afterwards, is actually similar to the existing Athlon XP, but at the same time there are striking differences. We also got to know the name of the architecture AMD implemented in Hammer. It's "x86-64", that is, 64-bit x86 (by analogy with x86-32). In this article we'll try to make it clear what x86-64 Hammer actually is and how it differs from Athlon, Pentium 4 and Itanium.
But first, let's think what do we actually need 64-bit processors for? The answer is simple - the present-day applications started setting extremely resource-hungry tasks to computers. In particular, the 4GB of RAM is nothing inaccessibly big now. Why exactly 4GB? Simple: 32-bit processor can address the memory only with these 32 digits, so the maximum memory capacity in modern x86 systems is limited to 232 = 4,297,967,296 Bytes = 4GB. It should be mentioned that Xeons can emulate 36bit addressing, that is, address up to 64GB, but tricks like that lead to worse performance. Moreover, the maximum memory size an application thread can use is equal to the same 4GB. That was one of the causes people started thinking of constructing 64bit processors.
Basic Principles of x86-64 Architecture
x86-64 is the 64bit architecture AMD developed for its Hammer processor family. In contrast to the 64-bit IA64 architecture used in Intel Itanium processors, x86-64 is based on the existing x86-32 architecture. It means the x86-64 based processor can run all the existing 32-bit applications without any difficulty. There are quite a lot of them now, you know, and they cost a lot of money. And applications like that can be run without any performance losses, unlike in case of Intel Itanium where the x86-32 instructions have to be emulated. So we don't have to wait until the developers recompile their products for the new platform in order to start using Hammer systems. On the contrary, the new AMD processor has all the advantages of its predecessors, but adds a few extra possibilities to them, that can be employed afterwards.
AMD chose this strategy on purpose. While Intel can just force the transition to something new (remember Rambus or various sockets supported by this company at the same time), AMD sticks to backward compatibility (again, remember the life-time of Socket A). AMD used a similar approach to the development of the 64-bit architecture.
So, how did they implement 64bit algorithms in Hammer? Very simply: a few new registers were added to the register set, and the existing ones were extended:
As you see, there are 8 more R8-R15 general-purpose registers (GPR), which are used in the 64bit mode (it means that they require programs recompilation) and the existing EAX, EBX and so on are extended from 32bit to 64bit. Eight new registers were added to the SSE unit to support SSE2. The increase in the number of registers is meant to raise the performance of resource-hungry applications, e.g., in scientific calculations (by the way, it has always been a strong point of AMD processors starting with K7).
The extended registers are shown in the picture:
As you see, the extension of EAX to RAX is the same as the extension of AX to EAX - the thing we saw 15 year ago with the launch of the i386 processor. As you remember, i386 did excellently with 16bit applications written for its predecessor i286. It is going to be the same with Hammer: the processor will easily run 32bit code, though in this case it just won't work at its full capacity.
To support both 32bit and 64bit code and registers, the x86-64 architecture allows the processor to work in two modes: Long Mode with two sub-modes (64bit and Compatibility modes) and Legacy Mode. You can read in the table what these modes are meant for:
|Address length||Operand length||Additional registers||GPR size|
|Legacy Mode||32bit or 64bit||no||32||32||no||32|
- 64bit mode features the support of:
- 64bit virtual addresses;
- 8 new 64bit general-purpose registers;
- GPRs extension to 64bit (including the "old" EAX, EBX and so on);
- 64-bitinstruction pointer;
- New relative instruction pointer (RIP) method of data-addressing;
- Continuous address space with a single space for instructions, data and the stack.
- Compatibility mode provides binary compatibility of the existing 16- and 32-bit applications with the 64bit operation system. It is implemented according to the principle of separate code segments. But unlike the 64bit mode, here the segmentation works as usual, using the protected mode semantics. The running application views the processor as an ordinary x86 CPU in the protected mode. However, the operation system regards the mechanisms used for address translation, work with interrupts and exceptions, and system data structures as if they were of 64bit Long Mode;
- As an addition to Long mode, x86-64 is supposed to support Legacy mode thus providing binary compatibility with 16- and 32-bit operation systems. That is, in Legacy mode the processor functions as an ordinary 32bit x86 CPU. None of the 64bit instructions is involved here. The mode provides full compatibility with all existing x86 architectures. It includes the support of segmented memory and the 32bit GPRs and instruction pointer.
As you see, the Hammer features are used to the full extent in 64bit Long mode, that is, with 64bit operation systems. We should make a note that there'll be enough of operation systems like that by the time Hammer hits the market. Right now AMD is working in collaboration with Microsoft on 64bit 8th generation processors support in Windows OS. 64bit clones of Linux OS from main developing companies will also support Hammer processors. Actually, AMD has already demonstrated that Hammer is supported in beta-versions of Windows-64 and SuSE Linux. So, there should be not even a slightest doubt that 64bit operation systems understanding the x86-64 architecture will appear.
Athlon and Hammer: Processor Core Similarities and Differences
Although Hammer processors are positioned by AMD as "8th generation processors", it should be pointed out that the new architecture is the logical development of the K7 architecture. So, the work on the Hammer core didn't start from scratch. To illustrate my point, let me show you the structure of the processor core:
If you remember the K7 structure you may even be a little surprised as Hammer looks very much the same from the inside. It means that the instructions are processed in Hammer and Athlon in about the same way, if we disregard the support of new 64bit instructions and registers.
But despite the external similarity, the processor core of the 8th generation CPU has undergone certain changes. To cut it short, we would like to mention the following:
- Level 1 cache remained the same. It's size is 128KB, 64KB are for data and 64KB - for instructions.
- The maximum size of level 2 cache, which the core can address, is lowered from 8MB to 1MB because of the Hammer architecture. Anyway, though Athlon processors could in theory support 8MB L2 cache, they never really did so. Hammer is intended to come to the server market, so, in case a larger than 1MB cache memory is necessary, AMD is going to use level 3 cache;
- The processor pipeline is 2 steps bigger. It will allow Hammer to work at higher clock-rates than Athlon does;
- Hammer will feature an improved branch prediction unit;
- 8th generation processors will also feature larger translation lookaside buffers (TLB).
Let's dwell a bit more on some of the innovations. First of all, note that L2 cache in Hammer remained 16-channel associative one, like in Athlon. L2 cache also remained exclusive. Anyway, AMD claims that Hammer's cache was developed independently of the same unit of Athlon XP. So the analogy doesn't necessarily mean that they'll have the same efficiency. For example, there are hopes for a long-awaited increase in L2 cache bus width. The 64bit bus used in the Athlon CPU family looks a sort of out-dated and we tend to believe that this bus in Hammer will have at least 256bit width as Pentium 4 has.
By the way, Pentium 4 showed to the entire world the importance of a long pipeline. The longer pipeline in comparison to all the predecessors and competitors allowed Pentium 4 to work at incredible core frequencies, unattainable by processors of other architectures. But the performance of Pentium 4 working at the same core frequency appeared much lower than that of the competitors, as high clock frequencies do not necessarily imply that more instructions are processed per clock. Unavoidable mistakes in branch predictions make Pentium 4 empty its gigantic pipeline and stay idle until it is full again. All this leads to lower performance.
That's why AMD decided not to follow in Intel's footsteps and extended the pipeline in its Hammer just a little bit, compared to the Athlon family. Here the engineers had to compromise. In order not to lose too much of the performance with the addition of extra pipeline stages, AMD subdivided the process of the instructions fetch from the cache and the instruction decoding into Macro-ops (simple operations performed by the processor core). So, the performance of Hammer and Athlon working at the same core frequencies will be about the same. In general, the number of the integer pipeline stages in Hammer appeared equal to 12 while Athlon's pipeline had 10 stages. By the way, the experts consider 12-13 stages to be the optimal pipeline length for processors with the core clock frequency from one to several gigahertz.
To make up for the longer pipeline, AMD is going to equip Hammer with an improved branch prediction unit (BPU). AMD is targeting the new BPU for complex calculations (one of tasks like that, as Hammer developers joked, was the development of a new processor). That's why the global history counter buffer is made four times bigger as compared to the size of this buffer in Athlon CPUs. Thanks to that, the Hammer's BPU "remembers" more branches and predicts future branches more efficiently. If the simple revision of the older branches is not enough for a correct prediction, the additional unit - branch address calculator - comes into action. BAC can quickly (five clocks) and precisely enough try to calculate the next branch address. The processor doesn't stay idle until the exact branch address is obtained, but processes the two branches of the program simultaneously. It should be noted that AMD considers it more important to make correct predictions, rather than increasing the capacity for the parallel calculation of several code branches, just as Intel did in its Itanium family.
So far we have been discussing the innovations intended to boost the efficiency of the processor work with the code, but now it's time to turn to the data. Especially since there also were some changes and they are mainly connected with the TLB. TLB (translation lookaside buffers) are a special processor cache (also two-level, by the way), which serve to perform the translation of virtual data address or instructions into the physical ones much faster. The thing is that the Hammer processor can't store or use physical addresses and has to resort to virtual addressing. The translation of the virtual address into the physical one takes about three clocks. TLB keeps the results of previous translations, so the translation of the addresses of the previously used data is performed in one clock. By the way, one of the causes Pentium III could sometimes beat the architecturally more progressive Thunderbird (it would be incorrect to compare Pentium III with Athlon XP, as the new AMD processor has much higher core clock frequency) is that it had a larger L1 TLB (Pentium III had 32 positions for instructions and 72 for data while Athlon had only 24 and 32 respectively). It should be noted that Athlon XP didn't catch up with Pentium III here, as it had only 40 positions of the TLB assigned for data. Hammer will feature 40 positions for instructions, the number of positions for data remaining equal to 40.
An even more interesting feature of the way TLB is implemented in Hammer is its excellent performance when switching between the tasks. Usually when the operation system switches to a different thread, TLB is cleared. In case of extremely intensive work (e.g., server work) the system switches between the tasks very often and the processor has to clear and refill the TLB all the time, and it negatively tells on performance. But there's another approach. For example, existing RISC-processors feature quick TLB restore by assigning a unique id to every thread and keeping its TLB content. It provides faster TLB refill when switching between the tasks. Hammer is supposed to have a similar system. That's why this significant L2 TLB increase up to 512 positions (two times larger than in Athlon XP) is quite natural.
Fresh Idea: Memory Controller in the CPU
One of the main innovations in Hammer is a memory controller integrated into the processor core. The same approach was used by Transmeta in its Crusoe solution. AMD decided to evolve the idea a bit further. The main advantage of a built-in memory controller over the ordinary one, which is placed in the North Bridge of the chipset, is that it works at the processor core clock and, as a result, has lower latency. And the higher frequency of the processor is, the lower will be the latency.
One more advantage of the integrated controller is that AMD won't depend on chipset makers as it comes to the work with the memory. There were cases when a poor memory controller of the chipset greatly limited the overall system performance. The manufacturers even had to release revisions to avoid the problems with the memory (remember the KT266 case). Moreover, the data won't now be transferred by the processor bus, so there'll be one "bottleneck" less.
The Hammer memory controller will work with DDR memory of the PC1600/2100/2700 standards and will be 64- or 128-bit wide. It means that either one or two memory channels can be involved. And as AMD decided to promote Hammer in the server market, the ECC memory support looks quite natural.
AMD claims that its memory controller will support "future memory standards", too. It seems to be about the fact that as soon as DDR II comes out (next year) and then the other memory standards (such as DDR III), the memory controller will be modified accordingly.
Let's try to understand how the Hammer memory controller works. Look at the chart:
So, the processor is equipped with an integrated memory controller (MCT) and DRAM controller (DCT). MCT is the interface between the processor core and DCT. MCT doesn't depend on the memory type while DCT is the circuit intended for the use of a specific memory type. So, AMD can produce Hammers with support of any memory type, and the only thing they will need to do - to replace the DCT, which is quite a small part of the processor. Well, this is a rather flexible approach providing easy support of future memory standards with not much effort applied.
The integration of the memory controller into the processor core opens wide possibilities for chipset makers at building single-chip solutions for Hammer. AGP controller remains the only thing left from the traditional North Bridge and it can be easily implemented into the South Bridge chip. Actually, SiS is already producing single-chip solutions, which unite North and South Bridges in one chip. To tell the truth, it will be much easier to build a chipset like that for Hammer: the integrated memory controller will save you a lot of time and trouble. This solution allows making the mainboard design much simpler and, as we hope, reducing the product cost, which will result in growing popularity of the processor in non-High-End market segments.
One More Innovation: HyperTransport
HyperTransport (former LTD, Lightning Data Transport) is a high-speed "point-to-point" data transfer bus developed by AMD and first implemented by NVIDIA in its nForce chipset to connect North and South Bridges. Saying that HyperTransport is widely used in Hammer systems is the least you can say about it. HyperTransport in Hammer systems means much more :).
This bus is used to connect the processor and the chipset, different parts of the chipset developed by AMD for Hammer, and different processors in multiprocessor systems (see below) by means of additional HyperTransport controllers built into the processor. To cut it short - everywhere… Why? What's so good about HyperTransport? Well, it really has a lot of good about itself: high speed, low latency, simple design (few wires). The maximum data-transfer rate provided by HyperTransport is 6400MB/sec one way. It can be easily changed by setting the width to 2, 4, 8, 16 and 32 bits and the frequency to 400, 600, 800, 1000, 1200 and 1600MHz, thus getting the necessary data-transfer rates (from 100 to 6400MB/sec forth and back. For example, to connect processors in multiprocessor Hammer systems HyperTransport will provide 3.2GB/sec each way.
|Effective frequency, MHz||Bus width (number of pins)|
|2 (24)||4 (34)||8 (55)||16 (103)||32 (197)|
The top line shows the bus width (in bits) in every direction. The number of lines necessary to build the bus of this width is given in brackets. The left column shows the effective working frequency of the bus (HyperTransport uses data transfer along both signal fronts). And the cells of the table (italic) contain the connection bandwidth (each way) for these parameters. You can read more about HyperTransport in our article.
To understand better the ideology of HyperTransport in Hammer systems, it's convenient to consider the AMD-8000 chipset AMD developed for its eighth-generation processors. This chipset differs greatly from all its predecessors and marks the new approach to chipset architecture. AMD-8000 doesn't have the traditional North and South Bridges, but consists of the so-called tunnels, which are the controllers featuring one input bandwidth and different output bandwidth, so that they take advantage of this difference for their own purposes. The chipset allows "chaining" unlimited number of tunnels thus constructing systems of various complexity and characteristics.
AMD-8000 was announced to have three "bricks" of the kind for building Hammer systems. They are:
- AMD-8151 graphics AGP tunnel, supporting AGP 3.0 bus;
- AMD-8131 PCI-X tunnel, supporting PCI-X bus;
- AMD-8111 input/output tunnel, supporting USB ports, IDE-devices and PCI bus.
Connection between chipset components and the processor or between CPUs in multiprocessor systems is implemented by means of up to three integrated HyperTransport bus controllers (16bit wide with 3.2GB/sec bandwidth each way).
AMD-8151 graphics AGP tunnel is an AGP bus controller, supporting AGP 4x and AGP 8x graphics cards. This chip also features two HyperTransport bus controllers: 16bit input one and 8bit output one. Thanks to that, the AGP tunnel can receive data at the speed of 3.2GB/sec and transfer it further (in AMD-8000 - to the South Bridge) at 0.8GB/sec. The remaining 2.4GB/sec are used for "controller's own purposes". Quite enough for AGP 8x with 2.1GB/sec bandwidth, isn't it?
AMD-8131 PCI-X tunnel, as well as AMD-8151, has two HyperTransport controllers with 16 and 8bit bus widths each way. The bandwidth of the buses is 3.2 and 1.6GB/sec each way respectively. But the "filling" of the chip is different as it has two PCI-X bridges.
AMD-8111 input/output tunnel, unlike AMD-8151 and AMD-8131, only has one 8bit 400MB/sec HyperTransport bus controller. It's supposed to be always at the end of the HyperTransport chain. AMD-8111 supports ordinary 33MHz 32bit PCI 2.2 bus, AC'97 and 10/100 Ethernet interfaces, two USB 2.0 hubs and ATA/133 IDE-controller.
Creating different combinations of tunnels and controllers, we can get systems for various purposes:
- Low-end server with one I/O tunnel
- More complex servers supporting PCI-X devices
- High-performance workstations
As we see, the modular architecture of AMD chipset allows using it in different areas. Moreover, it's easy to implement the support of specific features by adding extra tunnels to the HyperTransport chain. The HyperTransport standard is open, so the tunnels can be made by other manufacturers as well.
A good example of HyperTransport implementation is reference-mainboards for ClawHammer processors. These are the uni-processor Solo mainboard equipped with one socket and AMD-8151 and AMD-8111 chips and Stretto, which is equipped with two sockets and the same tunnels.
Brothers in Arms: ClawHammer and SledgeHammer
Right now AMD is planning to produce two Hammer modifications: ClawHammer and SledgeHammer. The first one is intended for desktop PCs and low-end dual-processor servers. It'll be shipped under the well-known Athlon brand, possibly with some suffix, such as Pro, Ultra or 64. The latter is the server version of Hammer targeted at two-, four- and eight-way servers. The official name of SledgeHammer is already known. It's Opteron.
|ClawHammer samples||SledgeHammer samples|
ClawHammer and SledgeHammer will share basically the same architecture, but as the processors are targeted at different markets, their features will differ slightly. To cut it short, we can say that desktop ClawHammer is a "lite" version of the server SledgeHammer. The differences can be seen from the following table:
|Number of transistors||67 million |
(45 million according to other source)
|Manufacturing technology||0.13micron SOI||0.13micron SOI|
|L1 cache for instructions||64KB||64KB|
|L1 cache for data||64KB||64KB|
|Supported memory||DDR200/266/333, single-channel||DDR200/266/333, dual-channel|
|Integrated HyperTransport controllers||Two 8bit or one 16bit||Three 16bit|
|Multi-processor configurations support||Up to 2||Up to 8|
As you see, the ClawHammer core size is even smaller than today's 0.18micron Athlon XP (129sq.mm). It will allow AMD to lower manufacturing costs and tag acceptable prices to the new processors. Unofficial sources say that at launch ClawHammer will cost about $400 while mainboards for it - about $200. It's not that expensive for the high-performance sector, where these processors are expected to fit actually. Compare it with the price of top Pentium 4 models - $500-600. Well, it's not surprising as Pentium 4 has a larger die size - 131mm2. Even if we take into consideration the fact that Intel uses 300mm wafers and AMD - 200mm one, the manufacturing cost of future AMD processors won't be higher than that of Pentium 4.
It should be mentioned that ClawHammer was cut down significantly compared to SledgeHammer in order to reduce the manufacturing costs. One of the most disappointing things is L2 cache cut by four times: ClawHammer cache size won't even reach the level of modern Pentium 4. It seems that high-performance memory subsystem could make up for the smaller cache, especially since Hammer processor is equipped with the integrated DRAM controller. But the controller used in ClawHammer only supports one DDR SDRAM channel with the maximum bandwidth of just 2.7GB/sec.
One more thing to mention is the different sockets ClawHammer and SledgeHammer are going to use. It seems this way AMD wants to prevent the customers from the natural desire to succumb to the temptation of using cheaper CPUs instead of their more expensive brothers. For instance, Athlon XP is widely used in dual-processor systems where Athlon MP is supposed to be. This will never be the case anymore, said AMD. So, they are planning to promote three sockets at a time in the beginning of 2003. Socket940 is for server and workstations market, Socket745 is for desktops and low-end servers, and Socket A, which AMD is going to support throughout 2003, is for Value computers.
Multiprocessor Hammer Systems: Something to Wonder at
As we have mentioned before, every Hammer processor will feature two or three HyperTransport controllers. This number of buses is more than enough to ensure proper connection with the chipset. So, what do we need the other buses for? To build multiprocessor systems! The key issue about building multi-processor systems with Hammer CPUs is the use of the same HyperTransport bus.
This way, the implementation of a dual-CPU (or four- or eight-CPU) configuration doesn't require any support from the chipset. And as HyperTransport is quite easy to layout on the mainboard, dual-processor Hammer systems won't possibly be expensive and will have green light to enter the desktop market.
There's an interesting question, though. Every Hammer has its own MCT with DDR SDRAM connected to it. What happens to the memory in a multi-CPU system? The thing is that every CPU in a system like that will be able to access other processors' memory besides its own. The access goes along the same HyperTransport bus. AMD claims, its bandwidth of 3.2GB/sec each way is more than enough to transfer data within the multi-CPU system. As a result, the memory turns into a single block, as in ordinary SMP-systems. As every SledgeHammer can use up to 8 modules of 2GB each, the maximum memory capacity in 8-processor system could reach 128GB (!). By the way, there won't be any problems with addressing it as every CPU can address 1TB (1024GB) of memory (Hammer uses 40bit physical and 48bit virtual addressing). Let's see what a four-way Hammer system will look like:
As you see, the processor interconnection bandwidth is growing in proportion to the number of CPUs, as the number of HyperTransport lines increases. The advantages of multi-CPU system design like that are evident. First, the processors don't need to share the memory bus as in SMP-systems from Intel. Second, direct processor interconnections solve the problem of cache coherence by transferring the data directly from one CPU to another.
As for performance and latency of the memory subsystem in multiprocessor Hammer systems, there has been a hot argument in our forum. The main point of the discussion has been the suggested by AMD scheme showing the connection between peer processors, its latency and shortcomings compared, for example, to the scheme used by Sun, namely the scheme of the processor connection with a central hub (we are talking about an 8-processor system here).
I'll try to sum it all up in a way and express my opinion on this subject. Well, the main reason to doubt that AMD chose the best scheme is the fact that the bandwidth of inter-processor HyperTransport buses (each way) is lower than the memory bandwidth (3.2GB/sec against 5.4GB/sec in SledgeHammer and 1.6GB/sec against 2.7GB/sec in ClawHammer). So, a processor cannot get full access to the memory of another one. Moreover, the data have to go through several processors on their way from one to the other (e.g., from number 1 to number 8) and it leads to higher memory subsystem latency, which gets unpredictable. I think AMD knows all about it better than we do and is going to do something about it. For example, there could be some scheme of optimizing data in the memory, so that the data necessary for a processor could be stored as close to it as possible, that is, in its own memory or in the memories of nearby CPUs. Of course, it won't solve the problem completely, but will reduce the number of critical situations of the kind, similar to the one described above (1-st processor - 8-th processor). Moreover, as the HyperTransport bandwidth is lower than that of the memory bus, the processor can go on working with a reduced to 2.1GB/sec memory bandwidth nevertheless at the same time transferring data to the next CPU. Of course, there can be situations when two processors are simultaneously requesting data from a single CPU and the memory bandwidth won't be enough to provide for both, not to mention its own needs. But I think such situations won't happen too often. And the 3.2GB/sec memory bandwidth is high enough for effective work of the Hammer systems with the memory, at least no worse than in multi-processor systems with other architectures.
The suggested eight-way architecture may be questionable, but has clear advantages over SMP-systems offered by competitors. Firstly, the minimum total bandwidth of HyperTransport buses connecting any two processors is 6.4GB/sec each way as there are at least two uncrossing ways between any two processors. Secondly, in the central, most loaded part of the net the processors are connected in three, not two, ways. Thus, we get 9.6GB/sec bandwidth. It's more than in Xeon MP based systems, where the common 3.2GB/sec bus is divided between four processors. And it's better than in Itanium 2 based systems where all processors share a 6.4GB/sec bus. As for higher latency during memory access, the enlarged L2 cache of SledgeHammer serves to make up for it.
It should be noticed that AMD pays a lot of attention to the stability of its Hammer systems. Hammers will feature a built-in thermal diode and overheating protection circuit. Moreover, they'll be able to dynamically reduce the frequency and core voltage depending on the working mode. And don't forget the so long-awaited IHS (Integrated Heat Spreader) - 2.5mm copper plate with nickel covering that protects the Hammer core against mechanical damage and serves to take off the heat. By the way, the maximum heat-generation of Hammer core processors won't exceed 70W. There'll even be models that can function with passive cooling.
Performance: First Estimates
We shouldn't hope to see a drastic performance growth by the 8th generation processors. Don't forget that Hammer has the same architecture as Athlon. Anyway, by preliminary estimates, Hammer will run ordinary 32bit applications (the ones that we have now) about 25% faster than Athlon XP working at the same core clock frequency. The integrated MCT will contribute 20% of the performance boost and the improvements of the core - 5%.
Recompiled for x86-64 applications, Hammer will run about 10% faster thanks to the extra registers and code structure changes. This will be the case without any code optimization: just due to recompilation. The SSE2 support may also add a few points.
Of course, the above mentioned estimates of the performance growth are preliminary and depend on the application. We all remember Athlon XP, which architectural innovations (not radical at all) allowed it to perform in some applications about 1.5 times faster than Thunderbird of the same core clock. But in most applications the growth was not significant - about 5-7%. The same thing may happen to Hammer: the high or low performance growth will depend on the specific application. The highest performance increase is expected from apps that make copious use of the memory and switch very often between the threads. Why? The answer is very simple: the changes in Hammer compared with Athlon XP were intended for tasks like that, that's clear.
Some time ago on the Web there appeared the first benchmarks of a Hammer sample working at 800MHz. We wrote about it in the news, look here for details. Now I'll mention the conclusions made:
- ClawHammer L1 cache works slower than that in Athlon XP (not much slower);
- L2 cache is a little faster (remember AMD promising to improve the cache?);
- Quake3 test showed that ClawHammer works in this application as fast as Pentium 4 (Willamette) supporting twice as high core frequency. The built-in MCT must have told here a lot.
Note, that testing was performed with ordinary 32bit applications under 32bit Windows XP OS, that is, without the recompilation for x86-64. So, ClawHammer still has reserves to boost the performance even more.
What We Have Now
So we've talked about the processor. Yes, it's good it's excellent it's maybe the best one. But when will it arrive? First 8th generation processors are to come out by the end of the year, official sources say. But will AMD be able to stick to its own schedule? Maybe Hammer is still far from "hardware" incarnation? Well, no, not really, as AMD already has both ClawHammer and SledgeHammer samples, and the first one was even demonstrated as far back as Intel Developer Forum in the end of February.
ClawHammer system was assembled in an ordinary ATX case based on the reference Solo 2 mainboard built on the AMD-8000 chipset. There were problems still with HyperTransport AGP3.0 Graphics Tunnel so the system was demonstrated with a PCI graphics card only.
To prove the fact that ClawHammer can easily run both 32- and 64-bit applications and operation systems, two systems with different OSs were demonstrated working. One was the ordinary 32bit Windows XP, the other - 64bit version of SuSE Linux. Windows XP ran MS Office demo-scripts. Of course, it's not a resource-hungry task, but you should bear in mind that there were still 8 months left until the CPU launch. The showcased ClawHammer had A0 revision, which indicates that it was the first processor in silicon. By the way, usually most processors of this revision cannot function at all!
Well, I seem to have veered a bit away from the topic of our discussion. The Linux system comes next. It demonstrated that Hammer working with 64bit OS is at ease running both 64- and 32-bit tasks. The demo application was "bouncing balls", balls bouncing against the borders. Two 32- and 64-bit versions were shown on a single screen simultaneously.
The systems were working that way for a few days and there was not a single failure reported. Rather strange for an A0 version of the processor. The core clock of ClawHammer shown at IDF was about 800MHz, which is a half of what it's going to be at launch.
The next forum AMD participated in was CeBIT, the Hanover-based largest high-tech exhibition in the world. By that time AMD had already fixed the AGP-part of the chipset, so the system came with an AGP graphics card.
Recently AMD demonstrated dual-processor Opteron (SledgeHammer) based system, working under a 64bit version of Windows.net.
Then, at E3 conference, AMD showed a ClawHammer system everyone could play "Medal of Honor" on. No problems were reported.
Computex saw two impressive demonstrations by AMD. First, a four-way Opteron based system was working as a 32bit web-server under 64-bit SuSE Linux. It served web-pages requested by the 8th generation AMD Athlon processor based computer.
The second demonstration showed to the whole world the possibilities of AMD-8151 graphics tunnel. The system based on ClawHammer and the AMD-8000 chipset was easily working with AGP 8x SiS Xabre graphics card.
AMD's partners (VIA, SiS, ALi and NVIDIA) also showcased at Computex their chipsets for ClawHammer. Here we listed the characteristics of these chipsets:
|AMD-8000||VIA K8HTA||SiS755||ALi M1687|
|USB 2.0||6 ports||6 ports||6 ports||6 ports|
|PCI||4 devices||5 devices||?||?|
You can see from the table that SiS755 has more features than any other solution. The only thing it doesn't have but others do is PCI-X available in AMD-8000. But PCI-X isn't necessary for desktops. On the whole, chipsets for Hammer offer a pretty high level of functionality.
You may ask, "And where's NVIDIA?" Yes, NVIDIA demonstrated a Hammer chipset at Computex, too. But there was no information about its characteristics, the only thing known is that it's nForce based. As nForce specs have been reviewed a lot of times, I thought it unnecessary to repeat them. There's also a question of the built-in graphics core. As the memory is accessed through the processor in all Hammer systems, it may negatively tell on the GPU performance. NVIDIA's ClawHammer chipset didn't feature a frame-buffer, so we dare suppose that this chipsets had no integrated graphics core. It can be so, but as we don't have any definite facts we didn't include NVIDIA's chipset (by the way, called CK8) into the table. We also have to mention that the mainboard based on SiS755 (which features an integrated graphics core) had an external frame-buffer.
|SiS755 reference board||VIA K8HTA reference board|
|NVIDIA CK8 reference board||ALi M1687 reference board|
So, 8th generation CPUs are going to make their first appearance as ClawHammer by the end of the year. First CPUs of the family will work at 1.6GHz core frequency and will have the rating of over 3000+. As we mentioned above, the processors will be sold under the Athlon brand with, possibly, some suffix added to the familiar name. But we shouldn't be very much excited about big amounts of ClawHammer CPUs coming to the market at the end of the year. Sources say that AMD will ship just a little quantity of its new 8th generation CPUs this year. Mass ClawHammer shipments are only going to start after the New Year.
Then in the first half of 2003 mass shipments of Opteron will take place. By that time ClawHammer will have reached the 4000+ rating. In the second half of the year the transition to 0.09micron technology will be underway and AMD will roll out the ClawHammer-S based processor. It's supposed to be a ClawHammer redesigned for the new manufacturing technology. As a result, ClawHammer-S die size will be 64sq.mm against the predecessor's 104aq.mm. The first ClawHammer-S will probably have a 4400+ rating. The third quarter will witness 0.09micron Hammer turning mobile. The first 8th generation mobile CPU will be rated about 3000+.
By the way, AMD has a good response to Intel's HyperThreading technology. It's simple: two processor cores stuck in a single package. So, we get two CPUs in one, which is quite contrary to HyperThreading where two "logical" processors are created within one core. One more advantage is evident: the number of execution units (ALU, FPU, SSE and so on) doubles and so, at least in theory, does the processor speed. It can never be achieved by HyperThreading. There are certain shortcomings, though. For instance, the dual-processor OS license would be required, as I can't think of a mechanism to distinguish between an extra CPU core and an extra CPU. Well now we're trying to share the bearskin before the bear is shot. It's not certain yet whether a processor like that will ever be manufactured…
The next year may see the arrival of the new (2.0) version of HyperTransport in Hammer systems. The development of this specification is in progress now and is supposed to be ready by the coming winter. AMD is also going to improve continuously the built-in MCT, adding new memory types support when necessary.
And the last thing to mention is that AMD's already developing the next core, code-named "K9". No one can tell yet what it'll be like (except AMD employees, of course), but the former AMD CEO, Jerry Sanders, said it would be something really outstanding…
Conclusion: Intel's in Danger!
And now let's try to compare Hammer with the competing products from Intel. These are Pentium 4 processor as a rival to ClawHammer and Xeon as a rival to Opteron. We'll leave aside the future Prescott CPU core from Intel, as there's little information about it so far. Intel is rumored to be about to add x86-64 instructions support to Prescott (the Yamhill technology) as well as other innovations, which effect is hard to predict.
So let's get back to Pentium 4. What advantages does it have over Athlon? Look:
- Work at much higher core clock resulting in better performance;
- High-speed bus (533MHz) and memory (up to RDRAM PC1066) subsystems;
- SSE2 instructions set;
- Larger L2 cache.
So what has Hammer got to say to this? Firstly, Hammer will basically work at higher frequencies than Athlon XP thus getting closer to Pentium 4 (by the time Hammer hits the market, Pentium 4 working frequency will have reached 2.8-3GHz, according to the today's information). As for performance, Hammer is going to bring the laurels back to AMD in the high-end CPU sector.
Secondly, Hammer will feature not just high-speed, but super high-speed HyperTransport bus. The MCT won't load it with the work with memory, as the controller is built into the core. The built-in MCT will also reduce to the minimum the latency level of the CPU-memory subsystem. Thirdly, Hammer is going to support fully SSE2, so Pentium 4 is deprived of that advantage, too. Moreover, Hammer will feature the strongest FPU inherited from Athlon (Pentium 4 FPU cannot compete with it at all) and its main trump will definitely be 64bit instructions. You can argue whether they're of any use now or not, but software developers are sure not to miss the opportunity of increasing the performance of their applications by simple recompilation. So, the arrival of x86-64 software versions is a certain thing.
Now let's get over to Xeon (Prestonia). It has the same advantages as the ordinary Pentium 4, because it's actually nothing but Pentium 4. New Xeons (Prestonia) boast HyperThreading support, though. Can Xeon defeat Hammer with this weapon? It doesn't seem so, as its effect is lower than the 30% Intel talked about, as to the system makers. It gives 10-15% boost, but that's not enough to make Xeon the winner.
Moreover, all Intel multiprocessor systems (Xeon with its HyperThreading is among them) use the same SMP, which suppose that CPU bus and memory bandwidths are equally employed by the CPUs installed. It may lead to such unpleasant things as collisions, that is, the simultaneous attempt of several CPUs to write data into the same memory area. After such a disappointment they have to stay idle for quite a while and attempt to write the data once again. As you remember from the multiprocessor section, Hammer has no bottleneck like that as every CPU uses its own memory. Besides, the tests performed by the well-known AnandTech site have shown that in some cases HyperThreading even slows down the performance.
Summing up all the things discussed above, we're left to state that if Intel's not going to take critical steps to improve the performance of its solutions, the company's running the risk of losing the lead in the High-End sector.