1999 turned out not very successful for Intel. On the one hand, the industry received Direct RDRAM not very enthusiastically,and on the other hand, VIA managed to make a really effective move with its Apollo Pro133/133A. As for the processors, the thingswere not quite smooth there as well. AMD finally managed to make full use of its hidden engineering potential and to offer a CPU,which would definitely help it out of the Low-End market sector. Moreover, AMD Athlon launched about half a year after Pentium IIIappeared quite a promising CPU in terms of clock frequency increase. Supposedly, in the end of 2000 these CPUs should bereaching the top of 1.4GHz while Intel's last offspring brought to the world this autumn, Coppermine, proved much lessoverclockable and by the end of the year is expected to achieve 1GHz at the most.
What did Intel have to do in this case? Perhaps, they had to speed up the launching of their next x86 core, the lastIA32 core - Willamette. At first, the launching date for the first processors on this core lied between the end of 2000and the beginning of 2001. Hence the first chip samples were expected in the beginning of summer. However, in January wesuddenly happened to know that Intel already had the first chip and in April they were planning to provide their preferredpartners with the samples.
In the meanwhile February set in. It was a month, which brought two most remarkable events in the CPU world: ISSCCConference and Intel Developers' Forum (IDF). One of the most impatiently awaited issues of ISSCC was the presentationof 1GHz Willamette. However, nothing like that ever happened. Intel showed only 1GHz Coppermine, having yielded to AMDwith its 1.1GHz Athlon. But Intel didn't want to leave anything unanswered and took its revenge at IDF. At the forum thecompany again very unexpectedly showed their Willamette working at 1.5GHz. Looks impressive, no doubt. We wonder if anythingis going to change at a closer look?
Willamette is expected to be the first radical reconstruction of P6 architecture. During the last five years since thelaunching of Pentium Pro there has hardly happened anything more significant than that. Just imagine: asymmetric core withthe units working at different frequencies, a considerably improved superscalar mechanism of instructions execution, a newcache tracking the instructions order, enhanced multimedia units and floating point units, an enormous instructions set forall possible and impossible purposes, absolutely new 100MHz bus transferring four data packs per clock, which makes 400MHzend-frequency, a 20-stage instructions hyper-pipeline, etc… Is it enough?
And now let's try to dwell on each particular item. To begin with, let's find out what helped a new Intel processor toshow such high clock frequency. We will speak about the instruction pipeline. At first, let's agree that we will always keepin mind a generally accepted thing: the longer is the pipeline, the easier it is to increase the clock frequency and the loweris the performance per each megahertz. And vice versa. Why? The matter is that the more stages are planned for the pipeline tocarry out, the faster these stages are passed. But! Suppose we have a simple block composed of a few operations connected witheach other:
- A=B+C
- D=A+1
In other words, operation 1 will be stored in the instructions cache for as long as necessary for the operation 2 tobe executed. And the time required by the latter will solely depend on the pipeline length. By the way, how long is it bythe today's CPUs? Pentium III has a 12-stage pipeline (17 FPU stages), Athlon - 10 (15 FPU stages), Alpha - 7 (10 FPU stages).According to these data, we see that Willamette is an indisputable leader in terms of pipeline length. It means that it requiresthe least time of all for a clock and allows getting maximum clock frequency as well as the longest delays for the connectedoperations (operation 2 will have to wait for 20 clocks unless operation 1 is completed).

Frankly speaking, everything is not so simple as it might seem. First, the buffer will always contain some instructions,which do not require any previously obtained results (for example: A=1+2). They can also be completed together with operation1 (the today's CPUs include some execution units, which can work in parallel), so that to avoid standing idle waiting for thepipeline to come to an end and to pass over to operation 2.
However, the longer is the pipeline (and hence the time needed for instructions execution), the lower is the chance thatthere will be enough independent instructions like that to fully utilize the execution units while operation 1 is beingperformed. Here the size of this buffer is of really great importance. A brief comment: Pentium III has a buffer for about40 micro-operations (1 x86 instruction is equal to about 1.5 micro-operations). Intel claims that Willamette will have a muchlarger buffer, which will lead to quite evident results.
(By the way, since we came to speak about the cache. The supposed L1 cache size by Willamette is 256KB, which is 8 (!) timeslarger than by Pentium III and twice a big as by Athlon. L2 cache size isn't disclosed yet, however, it is expected to be lessthan 1MB - about 512KB, maybe.)
Second, the branch prediction. The longer is the pipeline, the more important is the prediction telling which instruction willhave to be executed long before it is actually executed. And of course, any error at this stage, namely choosing a false branch,will undoubtedly tell on the CPU performance. The longer is the pipeline the worse will the effect be. Intel promised to increasesignificantly the probability of the right branch prediction in Willamette, by means of combining all the available predictionschemes. As to some sources, the efficiency of this algorithm by Willamette is almost 95%.
One of the tools to improve the performance here is an execution trace cache. Its main task is to store the instructions inthe order of execution, i.e. if the first instruction address is 100 and it is transferred to the second instruction with theaddress, shall we say, 200, then the second instruction will be stored in this cache directly after the first one, etc. This willhelp to eliminate branch prediction errors.
There is one more tool - Advanced Dynamic Execution. Here Intel means an improved version of a mechanism of superscalarout-of-order instruction execution when the CPU manipulates the instructions breaking their natural order so that to achievehigher utilization of execution units. This item also belongs to the consequences of a long pipeline and is called to minimizethe delays in instruction execution.
This seems to be all concerning Willamette pipeline. This is a very important factor for the CPU performance, however, theperformance of the units responsible for different operations is also of great importance. Among these various operations wecan enumerate the operations with integers, floating point numbers, and some other specific data, when one instruction dealswith several data packs simultaneously (SIMD).
Our impressions here are rather twofold. As for the integer operations, everything is in order: Willamette's integer unitperforms at a double CPU frequency. It means that a 1.5GHz chip shown at IDF worked at 3GHz! (Again, this is the end-frequency.In reality, the speed remains at 1.5GHz. It is actually the unit, which requires a half of a clock instead of the entire clock toperform calculations. In other words, its speed simply doubles). Besides, Willamette has two units that is why ideally there are4 integer operations per clock.
And as for the floating point unit, the picture here hardly impressed Intel. Two units like that (compared to three of them byAthlon) will provide a 1.4GHz CPU with a peak performance of only 1.4 GFLOPS in floating point operations, because only the firstunit deals with real calculations, such as FADD, FMUL, etc. The second unit is in charge of some accessory work, such as FMOVE,FSTORE. Here we have to mention that if by that time Athlon also supports 1.4MHz (and there is every evidence that it willhappen), this parameter will be equal to 2.8 GFLOPS.
So, Intel decided not to deal with x87 in its new processor, having concentrated on a SIMD (Single Instruction - MultipleData) instructions unit, which works with 64bit instructions intended for floating point operations and 128bit integerinstructions. Willamette has two such units: one for register operations and one for arithmetic ones. Since this is SIMD,there may occur such situations when one SIMD instruction made of four operations is executed within one clock. Altogetherwe get: four operations per clock + 1.4GHz = Willamette's peak performance when using SIMD makes 5.6 GFLOPS! Compare it with2.8 GFLOPS x87 of 1.4GHz Athlon or with 5.6 GFLOPS in case of a SIMD unit working with 3DNow! set.
That's why it's not surprising that Intel will do its best to promote a new set of Willamette's SIMD instructions (SSE2)as the best solution for floating point operations.
So, we get two possible variants.
Intel may manage to convince software developers of using SSE2 composed of 144 new instructions:
- 76 absolutely new instructions operating a wide range of data (including dual precision floating point numbers andfour-word integers - 64bit each, and if XMM registers are used and the data is packed then the numbers are of 128bit).Some instructions of this set allow the programs to control data caching, loading and storing in the CPU registers.
- 68 extended SIMD instructions for integer operations. In Pentium II/III they supported only 64bit MMX processorregisters, however in Willamette they will be able to use 128bit XMM registers of the CPU.
If Intel succeeds, Willamette will be the coolest in floating point operations at least till the end of theyear.
If the software developers aren't a bit more enthusiastic and go on using a good old x87, Willamette won't look as brilliantin floating point operations any more: it will hardly differ from Pentium III working at the same clock frequency.
Since the performance of the today's CPUs and memory subsystem keeps growing higher and higher, the fact that GTL+ system bushas got only 33MHz faster is not that impressive, actually. Besides, there has appeared a new platform - IA64. All in all, withits Willamette Intel is introducing a new system bus, which is expected not only to increase the general bandwidth (100MHz clockfrequency is even lower than that of the today's GTL+ - 133MHz, however, transferring 4 packs per clock makes the resultingfrequency equal to 400MHz). It should also turn a link between IA32 and IA64: after Tehama (a chipset for Willamette), itwill be used for i870, intended for both - IA32 Foster and IA64 McKinley.
So, these are the main advantages of the new bus: a significantly higher bandwidth - 3.2GB/sec (400MHz, 64bit) against1,064GB/sec (400MHz, 64bit) of the today's 133MHz GTL+ (3.2GB/sec is exactly the level, which can be provided by a dual-pipelineRDRAM supported by Tehama) and, of course, a rather promising future.
As for the disadvantages, here you are. 4 data packs per clock is, certainly, a cool thing, but only if they are readyby the time a new clock begins. Otherwise, the bus bandwidth won't be utilized to the full extent. Frankly speaking,3.2GB/sec will be achieved only in the most ideal situation. The second disappointment is connected with the today'smainboards, which in no way suit for Willamette. And it is not only a new system bus, which is to blame here. A newform-factor - Socket-462, this is the reason. This means that we will get an absolutely new platform, which cannot becured with any converters.
Well, that is it. What is the outcome? In fact, we have a CPU carefully following the principle: "people buy megahertz"and optimized for this particular principle and not for higher performance. So, as to some preliminary info, Willametteperforms on the same level as Coppermine supporting the same working frequency does. Or as Athlon does. That is why anyperformance increase of the new processor will undoubtedly result from its speed increase.

Photo by www.chip.de
New AMD processors are supposed to reach the same frequency level of Intel Willamette by the end of the year. (And bothof them will require new mainboards). They are expected to perform similarly, which means that the unstable balancing betweenIntel and AMD is very likely to continue straight into the beginning of 2001.
And then? And then Willamette has to become the last consumer x86 processor. In 2001 Intel will start optimizing it forfurther performance increase. And somewhere at that time we will welcome a new CPU from AMD - SledgeHammer (K8). Judgingby what we know today we get a really interesting picture: next year Intel is very likely to be catching up with the otherleaders in the x86 mainstream CPU market. If this market is still worth the effort by that time...





