In the eternal competition between the two microprocessor giants, Intel and AMD, the laurels now fully belong to the first one. Having pulled itself together, AMD managed to somehow respond to Intel's announcement of the Pentium 4 2.8GHz with the launching of their new Athlon XP 2800+, which is hardly available in the today's market. However, later on this year AMD will not undertake anything in the processor market, which could help them to represent a worthy opposition to Intel. The latter, on the contrary prepared the most interesting products to be launched particularly in the end of the year, namely a new Pentium 4 3.06GHz CPU. This move will firstly allow Intel to leave Athlon XP family hopelessly far behind in terms of performance, and secondly, with the new solution Intel managed to introduce into desktop processors the Simultaneous Multi-Threading technology, which has never been used in processors of the kind before. This way the company managed to put an end to the unannounced competition between Intel and AMD before the Christmas sales season was opened. As a result, most users will consider Intel the leader of this year, and AMD in fact can only change the situation next year when they get their new Barton core with 512KB L2 cache and 8th generation Hammer processors at their disposal.
I would like to stress here that getting past the 3GHz point appeared a much more significant event than it had been initially planned. And the primary reason for the tremendous stir around the thing is the support of Simultaneous Multi-Threading technology, which Intel calls simply Hyper-Threading (hereinafter we will refer to it using Intel's term). Intel is already using Hyper-Threading technology in its Xeon processor family, and it was expected to appear in desktop solutions with the launch of 0.09micron Prescott based processor. However, very cut-throat competition with AMD as well as the coming announcement of 8th generation AMD Hammer processors pushed Intel to make a few changes to its plans. As a result, Hyper-Threading technology appeared in Pentium 4 CPUs now already, which is about a year earlier than it has been initially planned.
Hyper-Threading technology is a relatively low-cost way of increasing the CPU performance at the expense of very insignificant die size growth, that is why we are going to dwell on the peculiarities of this technology in this article. We will also pay due attention to the performance of the new Pentium 4 3.06GHz CPU and will evaluate the "pure" performance gain provided by Hyper-Threading in each particular case.
Before we pass over straight to the technology and its features we would like to draw your attention to one thing. As is known, there is the whole bunch of ways to improve processor architecture or increase the performance. Here we could list such things as pipelining, super-scalarization, processing of commands with the modified order, cache-memory increase, etc. However, all these general methods lead to a pretty noticeable die size increase, which in its turn results into higher production costs and greater heat dissipation. Hyper-Threading technology is based on somewhat different ideology. It doesn't turn very expensive because of the "additional transistors", however it should be supported by the operation system and special software, i.e. it requires extra effort from the software developers.
Hyper-Threading Technology: Buy One CPU, Get Another One Free!
As is known, the CPU performance in general is built by two components: processor core clock frequency and number of instructions processed per clock. Pentium 4 architecture was initially intended to allow reaching high clock rates, because this CPU uses an extremely long 20-stage pipeline. This makes Pentium 4 clock rates grow by leaps and bounds, although the performance of these processors remains comparable with that of AMD Athlon XP working at considerably lower core frequencies. This can be explained first of all by the fact that Athlon XP features more execution units working in parallel and secondly it restores its 10-stage pipeline much faster in case of false predictions. This way, Athlon XP performs more instructions per clock, although it is also far from being ideal. Anyway, you remember that our today's story is about a different hero, the new Intel Pentium 4. However, I think you'd better keep in mind that everything we are going to say is also valid for the AMD Athlon XP architecture (with the corresponding corrections, of course).
The major problem about increasing the performance of the contemporary processors lies with the fact that the number of instructions performed per clock grows up not in proportion to the execution units of the CPU, but much slower. In particular, although Pentium 4 features 3 parallel integer units, 2 floating point units and 2 memory units, all these resources are never involved simultaneously. In the majority of cases most of these resources stay idle either waiting for the data or appearing of no use in this or that particular operation. In fact, the idling of the processor execution units in the first case can be combated somehow by increasing the cache-memory size for instance. But you will never be able to load the entire CPU with the existing concept of sequential calculations. For example, if the program adds some integers, then the FPUs will never be involved, no way. As a result, we get a really sad picture: most existing x86 programs can load not more than 35% of the Pentium 4 execution units at a time.
This particular problem gave birth to Hyper-Threading technology. Its major concept was first introduced in 1993 by a respectable Intel employee, Mr. Glen Hinton, who managed to notice about 10 years ago that the CPU resources were never utilized to the full extent. In 1996 Intel engineers started working on the future integration of this technology into the promising next generation CPU architectures, namely Willamette/Foster. On August 28, 2001 Hyper-Threading technology was finally introduced, and on February 6. 2002 first Intel Xeon processors with Hyper-Threading technology support were announced officially. Today, on November 14, 2002 Hyper-Threading arrived into Pentium 4 family.
A lot has changed since 1993. In particular, multi-threaded operation systems have conquered the market. Their ideology is based on the simultaneous work of several calculation threads referring to one or different active applications, or to the OS itself. If the multi-processor systems have no difficulty processing these threads simultaneously (each processor in the system gets one thread to process), then in uni-processor systems the CPU has to constantly switch between multiple threads splitting the time available between the processing of different thread parts.
This way if we enable the CPU to process more than one thread at a time, its capacities will get loaded much more efficiently. This is actually the major idea of Hyper-Threading. Due to this technology, one physical CPU is recognized by the operation system and applications as two logical CPUs. As a result, the operation system and applications suppose that a CPU supporting Hyper-Threading can process two threads simultaneously that is why they load it with much more work.
This is how Hyper-Threading technology actually works:
The CPU, however, undergoes really minor modifications and manages to take advantage of its idling resources for the second thread processing. In other words, Hyper-Threading is a technology, which allows to raise the CPU efficiency, though it works adequately only in multi-task and multi-thread environments.
On the left side you can see a CPU with Hyper-Threading,
while on the right - a regular dual-processor system
Let's say a few words about the modifications introduced in the processors, which acquired Hyper-Threading support. Since a physical processor with Hyper-Threading technology is none other but two logical CPUs, some of its units have been duplicated. Moreover, only some separate control units have been duplicated, the execution units remained the same: they just get loaded heavier and more efficiently. As a result, CPUs with Hyper-Threading have doubled registers, including general-purpose registers and control registries, Advanced Programmable Interrupt Controller - APIC, and some internal function registers, such as Next Instruction Pointer. All other resources including caches, execution units, branch prediction unit, bus controller, etc. are shared by the two logical CPUs. that is why the implementation of Hyper-Threading technology cost the developers quite little: the processor die size got only 5% bigger.
New core components
Hyper-Threading in Action
Now let's figure out how the processor actually works with Hyper-Threading technology (if you need to revise Pentium 4 architecture before we go into details, please see our Review).
The first part of the Pentium 4 pipeline is responsible for submitting micro-operations (uops, the decoded x86 instructions) to the execution part of the pipeline. This is exactly the place where all units duplicated for two logical CPUs are located. The picture below shows the beginning of the processor pipeline in two cases: with an instruction in the Trace Cache (a) and without it (b).
Trace Cache contains the already decoded instructions called uops. Most commands have already been decoded earlier during the regular processor functioning and are now located in the Trace Cache. This cache is not duplicated but is shared by the two logical processors. Nevertheless, each of them features its own Instruction Pointer pointing to the next instruction to be executed for both logical CPUs. Instructions are taken from the Trace cache in turns and are lined up in the so-called uop queue, which is also individual for each logical processor.
If there is no instruction in the Trace Cache, which is level one cache for instructions according to Pentium 4 hierarchy, the CPU has to decode another x86 instruction from level two cache. The extraction of instructions from the cache involves Instruction Translation Lookaside Buffer (ITLB), which translates the address stored in the Instruction Pointer into the physical address. ITLB is also individual for each logical CPU, while L2 cache has to be shared between them. There is only one x86 decoder in CPUs with Hyper-Threading, because it is never loaded too much, since most decoded instructions are stored in Trace Cache. If both logical processors address the decoded simultaneously, it has to take turns with both of them, but only as soon as it has completed the full decoding cycle for one of the two logical processors. The decoded instructions are saved in the Trace Cache.
The execution unit receives decoded instruction sequences in two lines for each of the two logical CPUs. And here is what happens to them next:
Of course, the support of Hyper-Threading should be granted not only on the software level, that is by the operation system and applications. The hardware support is also required, because the CPU supporting Hyper-Threading technology is still different from the regular processors. To activate both logical processors at least the mainboard and its BIOS should support two APIC and some specific algorithms translating the logical CPUs and the physical processor into power-saving mode.
As a result, if you want to have a CPU with Hyper-Threading technology working, you need not only a CPU with the implemented technology, but also a mainboard based on a chipset supporting it. As for the today's chipsets for Socket478 mainboards, we can state the following. All Intel chipsets supporting 533MHz system bus do support Hyper-Threading. Although there is an exception. i845G supports Hyper-Threading only beginning with the B revision. All older i845G chipsets (A revision) do not support Hyper-Threading technology. As for the chipsets from other manufacturers, the situation is not so clear here. VIA claims that its chipsets do support Hyper-Threading, SiS is about to start making new update chipset revisions in the nearest future. It is important to understand that Hyper-Threading is a fully open technology and the chipset makers do not have to pay any license fees to Intel for the opportunity to implement Hyper-Threading in their products.
Besides the support implemented in the mainboard chipsets, Hyper-Threading technology should be also recognized and initialized in the mainboard BIOS. Only in this case both logical processors can be initialized successfully and recognized by the operation system. Otherwise - if either the chipset or the BIOS do not support Hyper-Threading technology - the CPU with Hyper-Threading will be recognized by the system as one regular CPU.
If the hardware support is implemented correctly, the operation system will be absolutely sure that there are two processors installed:
It is also evident that fully-fledged utilization of the processor resources in systems with Hyper-Threading is possible only if there is multi-task operation system supporting dual-processor configurations. However, in order to really increase the system performance in this case, the operation system should be specifically optimized for Hyper-Threading technology. Namely, the system threads shouldn't use any empty cycles, which we have actually already mentioned above.
At present there are two operation systems optimized for Hyper-Threading technology: Linux 2.4.x and Microsoft Windows XP (including Professional and Home Edition). The widely spread Windows 98 and Windows ME do not support Hyper-Threading because they lack support for multi-processor configurations. As far as Windows 2000 is concerned, even though this system can work in multi-processor configurations and recognizes a processor with Hyper-Threading technology correctly (that is as two processors), their performance will still be lower in most cases than that of the analogous CPUs without Hyper-Threading support. The matter lies with the fact that system threads in Windows 2000 often work with empty cycles, which are a real threat to Hyper-Threading.
First of all, the instructions from two incoming queues pass through Allocator and Register Rename units. Here the CPU assigns resources to execute the commands. The registers and buffers in this case get split between the logical CPUs, however, once one of the logical CPUs refuses to use some of the assigned resources, they get automatically at the disposal of the other logical processor.
As soon as this stage is complete, the commands get into two sorted queues - for memory operations and other operations, which are also split into two groups for each of the two logical CPUs.
Then the micro-operations sorted out this way get to the Scheduling stage, where they are sorted according to the order in which they arrive to the execution units. The operations are sent to the scheduling units according to the first-in-first-out policy. If necessary, the scheduling units can switch from the queue arranged for one logical CPU to those for another one. By the way, at this stage the micro-operations coming from the logical CPUs get totally mixed up, so that they could be executed simultaneously. Since the registers of the physical processor turn very hardly tied to the registers of both logical CPUs, it appears possible to execute instructions without knowing where which command belongs.
After the execution stage where the processor doesn't distinguish between the logical CPUs, the Retirement unit follows. There they restore the initial instructions order and figure out anew to which logical drive they belong. Re-Order Buffer in this case is divided into two halves: each for one of the two logical CPUs.
Also please note that although L1 and L2 caches are shared between the two logical processors, and although Data Translation Lookaside Buffer (DTLB) transforming the addresses of the data processed into their physical addresses is also allegedly shared, all the notes stored in it are also marked with a CPU identifier. This way you can always tell to which logical processor the taken line belongs.
This way, Hyper-Threading technology really does allow to load the CPU execution units much heavier due to simultaneous processing of the two threads. However, you should keep in mind that the effect made by this approach cannot always be positive. Firstly, if the processed threads are similar in terms of instructions types, there may be simply no performance increase at all, because one of the threads will eat up all the resources required by the second thread, while the other execution units of the CPU will still stay idle. Secondly, the situation may turn out a complete disaster. For example, imagine that one thread keeps busy all the resources that the other thread needs urgently and waits for the data to arrive. In this case the operation system, which is aware of the two processors in this system, will not undertake anything to solve the problem. At the same time the processor will be simply paralyzed. This is one of the reasons why Intel stimulates the software developers to optimize their applications for Hyper-Threading. One of the major principles of this optimization is the use of the new PAUSE instruction, which will never freeze the physical CPU operation and thus avoid empty wait clocks.
Closer Look: Intel Pentium 4 3.06GHz
So, today, on November 14, 2002, Intel officially announced they new Pentium 4 processor - Pentium 4 3.06GHz. This processor is the first in the family, which supports Hyper-Threading technology and boasts the following features:
- Core clock frequency: 3066MH; Quad Pumped Bus frequency: 533MHz; clock frequency multiplier: 23x.
- L1 cache: 8KB for data, 12KB for instructions; L2 cache: 512KB.
- Northwood processor core manufactured with 0.13micron copper interconnect technology.
- Nominal Vcore: 1.525V.
- 131sq.mm die size, 55 million transistors.
- Socket478 physical interface.
- Supports MMX, SSE, SSE2.
- Hyper-Threading technology support.
As you can notice, the data listed above indicate that the processor core of Pentium 4 3.06GHz is of the same size and consists of the same number of transistors as the previous Pentium 4 2.8GHz. Strange, isn't it? Especially, since we have already mentioned above that the implementation of Hyper-Threading technology required about 5% bigger die. However, this is very easy to explain. It appears that Hyper-Threading technology was integrated into Intel Pentium 4 processors long time ago, and now it is simply activated. Strange as it might seem, but all Northwood based processors features everything necessary for adequate Hyper-Threading implementation. Moreover, the duplicated units necessary for Hyper-Threading have already been created even in the good old Pentium 4 processor on 0.18micron Willamette core starting from the very first models of this family. However, until lately Intel was disabling the Hyper-Threading support in its CPUs via hardware (at the die assembly stage). Therefore, if you are a happy owner of the older Pentium 4 CPU, you will never manage to enable your Hyper-Threading technology, even though your processor features those additional 5% of transistors.
After all that it seems quite logical that Pentium 4 3.06GHz CPU features the same C1 core stepping as its predecessor, Pentium 4 2.8GHz, and is manufactured from 300mm wafers.
Here we have to point out that even though semiconductor dies used in Pentium 4 3.06GHz and in CPUs with lower working frequency are hardly any different from one another, Intel is not going to add Hyper-Threading into slower processors. This way, Hyper-Threading technology will remain the advantage of Pentium 4 CPUs with the core clock frequency over 3GHz.
The second CPU model intended for the desktop market and featuring the Hyper-Threading technology support is due in Q2 2003. It will be Pentium 4 3.2GHz also based on the 0.13micron Northwood core. After that you will also see Hyper-Threading in all Pentium 4 processors based on the new 90nm Prescott core, which is due in H2 2002.
Now I would like to say a few words about the weakest point of the new Pentium 4 3.06GHz processor: high heat dissipation. Unfortunately, the introduction of Hyper-Threading technology automatically led to a pretty significant increase in the amount of dissipated heat. This is quite natural, since the CPU execution units are now used more actively and hence the CPU with Hyper-Threading warms up more than a similar CPU without this technology. As a result, Intel had to change the thermal and electrical requirements for the systems, which are intended to work with Pentium 4 processors supporting Hyper-Threading technology.
The initial version of Intel's requirements to the mainboard makers implied that the CPU would dissipate 77W of heat at the most. Now Intel has revised its requirements and released their new version aka FMB2. According to this document, the Pentium 4 processors can now dissipate 82W of heat. As a result, the mainboard makers should revise and modify their product design accordingly, if necessary. Moreover, the maximum current, which the Pentium 4 CPUs can now consume has also been increased. Now it equals 70A, while according to the initial requirements it was 60A at the most. So, the manufacturers of the up-to-date mainboards intended to work with the new Pentium 4 processors with the clock rates over 3GHz should now make sure that their product meets the new updated power and thermal requirements.
Besides, Pentium 4 3.06GHz also needs better cooling. In particular, now Intel recommends using new more efficient coolers with copper parts. They will also modify the design of the cooler shipped together with the boxed processors. The new cooler model will have copper foot, more ribs and a more powerful 5-blade fan with the adjustable rotation speed:
However, this is far not all. Intel has also introduced a few changes to the case thermal requirements, which will touch upon the cases intended to be used for systems with the new Pentium 4 with the frequency over 3GHz. One of the major changes is the fact that from now on the case temperature shouldn't exceed 42oC, although the previous allowance used to be 45oC. Moreover, Intel will definitely approve of those cooling solutions that will take the air for the processor cooler directly from the outside.
Testbed and Methods
The major goal of this tests session was to figure out the performance of the new Pentium 4 3.06GHz with Hyper-Threading technology. We will compare the performance of this processor with that of the same CPU when Hyper-Threading was disabled (you can enable or disable Hyper-Threading technology in the mainboard BIOS) and with that of the predecessor, namely Pentium 4 2.8GHz. Keeping in mind that the contemporary Pentium 4 systems can be assembled with two completely different memory types, RDRAM and DDR SDRAM, we ran all the tests on two platforms using different memory types and based on i850E and i845PE chipsets. These chipsets support Hyper-Threading technology and allow using the today's most powerful memory types: PC1066 RDRAM and DDR333 SDRAM respectively.
We compared the performance of Pentium 4 systems with that of the competing ones using the today's fastest processors from AMD, namely Athlon XP 2700+ and 2800+. Athlon XP based systems were built on the today's fastest Socket A solution - NVIDIA nForce2 chipset with dual-channel DDR333 SDRAM interface.
So, as a result, our testbeds were configured as follows:
|Intel Pentium 4|
|Intel Pentium 4|
|AMD Athlon XP|
|CPU||Intel Pentium 4 3.06GHz with Hyper-Threading technology|
Intel Pentium 4 3.06GHz, Hyper-Threading technology disabled
Intel Pentium 4 2.8GHz
|AMD Athlon XP 2800+|
AMD Athlon XP 2700+
|Mainboard||ASUS P4T533-C||ASUS P4PE||ASUS A7N8X|
|Memory||512MB PC1066 RDRAM by Samsung||512MB DDR333 CL2 SDRAM by Crucial|
|Graphics Card||ATI RADEON 9700 Pro|
|HDD||Seagate Barracuda ATA IV, 80GB|
All tests were run in MS Windows XP Professional operation system, and the BIOS Setup of the mainboards used was configured to show maximum performance possible.