Intel Prescott: One More Willamette-like Slow Processor or a Worthy Piece?

Intel finally announced the new 90nm core for Pentium 4 processors. In this article we will discuss in detail the peculiarities of the new Prescott, namely the new production technology, the improvements made to Intel’s NetBurst architecture, the heat dissipation of the new core and Intel’s plans in the processor market for the nearest future.

by Ilya Gavrichenkov
02/01/2004 | 12:01 PM

After a period of relative calm, the processor market woke up for another “hot” season. Last fall AMD made a significant jump forward and introduced their new Athlon 64 processor architecture. The launching of Athlon 64 and Athlon 64 FX stimulated a new round of cut-throat competition in the desktop CPU market. Of course, unlike its predecessors, Athlon XP CPUs, the newcomers performed much faster than the top Intel Pentium 4 models available at that time. Even the release of a Pentium 4 Extreme Edition equipped with L3 cache memory didn’t save the situation for Intel: the sky-high pricing and absence of the actual processors in the market spoilt the whole thing. The tension grew. Some analysts started drawing parallels with the year 1999 when the launching of the first AMD Athlon processors caught Intel evidently unawares. However, the situation today was still very much different from what we had in the far away 1999. Intel boasted a storing trump: new processor core aka Prescott. Although the launching schedule for this core has been slightly changed, the delay was not long enough for Athlon 64 architecture to strengthen its positions in the market really seriously.

 

Today Intel officially announces the first processors based on the new Prescott core. This way we witness another round of the processor arms race, namely the beginning of a deadly competition between two new micro-architectures: Athlon 64 from AMD and Prescott from Intel. However, we shouldn’t forget that AMD also has a few trumps in the pockets. In particular, Athlon 64 processors support x86-64 technology, which has not yet been used by contemporary software and applications. As for 64bit extensions in Prescott, it doesn’t have any (or Intel doesn’t think it necessary to disclose them to public). However, it is mostly a philosophical question whether today’s desktop processors should support 64bit or not, and it is going to remain a philosophical question at least until 64bit Windows XP versions appear. Today, AMD and Intel will again compete in a 32bit field.

In our today’s article we are going to discuss in detail the major architectural peculiarities of the new Prescott processor core and its differences from the predecessors, since there is quite a lot to talk about here. But before we pass over to the actual hero of our today’s story, I would like to make it absolutely clear what Intel actually announced today.

So, today, on February 2, 2004 Intel Corporation announced and started selling new processors formerly known as Prescott. These processors based on a new 90nm core will continue the Pentium 4 family at least for another year little by little ousting the previous 130nm Northwood core. The Pentium 4 processors on Prescott core announced today are clocked from 2.8GHz to 3.4GHz and are designed for Socket 478 mainboards with 800MHz bus and Hyper-Threading technology. Now let’s got down to details.

90nm Technology

Before we go deep into architectural details of the new Pentium 4 processors also known as Prescott, I would like to dwell on a new manufacturing technology used for these CPUs products. The thing is that Prescott appeared the first x86 processor manufactured in mass quantities with 90nm products technology. For example, Intel’s main competitor, AMD Company, is planning to shift to 90nm production technology only in H2 2004.

Intel has already introduced the new 90nm manufacturing technology in three fabs by now, that is why there should be no problems with the mass production of the new dies. I also have to point out that together with the transition to the next generation process technology, Intel also made a number of enhancements, which should speed up the transistors. As a result this should later lead to the possibility to increase even more the clock frequencies of the CPUs built with the new production process. Among these enhancements I would like to mention smaller gate length and the use of strained silicon.

Transistors used in Intel processors manufactured with 130nm technology feature 60mn gate length. With the shift to 90nm production process the transistor gate length will be reduced to 50nm. This automatically solves two problems at a time. First, the transistor switches and works much faster. And second, the transistor gets smaller in physical size, which allows creating more compact and at the same time more complex semiconductor devices. However, there is another side to the smaller transistor size: increase in leakage currents, which turns into an absolutely specific task in case of 90nm technology. For instance, Intel is applying a layer of nickel silicide right above the gate electrode to minimize the leakage currents (they used to apply cobalt silicide for the same purpose).

But the most interesting part of the 90nm production technology from Intel is the strained silicon technology. It was developed to ensure that open transistors allow sending through them higher electric current, and hence react faster and dissipate less heat. According to Intel’s data, the use of this new technology increases the current going through the open transistor channel by 10-20%. The idea behind this technology is very simple: silicon lattice used in the transistor channel is “stretched” so that the distance between atoms gets bigger. This is achieved by putting the silicon onto a special layer with broader lattice. Silicon atoms try to adjust themselves to the wafer lattice and move apart from one another. As a result, the electrons flow faster through the lattice with less resistance. Despite the seeming complexity of the strained silicon technology, it is not an expensive one to implement: the production cost per transistor with strained silicon technology is only 2% higher than that of a regular transistor.

I also have to say that besides the above mentioned innovations, the transition to 90nm production process also brought a number of less radical changes. In particular, Intel started using new Low-k CDO dielectric to isolate copper interconnects. Moreover, 90nm semiconductor dies now have more metal (copper) layers: Prescott has 7 of them, while Northwood processors had only 6. This innovation provides higher flexibility during complex semiconductor devices designing and allows fitting more transistors into a smaller die. In fact there is nothing revolutionary about it: AMD Athlon 64 processors, for instance, feature 9 copper metal layers.

I would like to draw your attention to the fact that Intel managed to perform the transition to a new production technology at the least expense possible. New lithographic equipment with 193nm wave length is used only for critical spots of 90nm dies. In all other cases Intel does very well with the old 248nm lithography. This is exactly the reason why Intel replaced only 25% of equipment on the fabs producing 90nm dies. However, I should stress that we can hardly consider the use of phase masks (required by 248nm lithography) a successful solution today. Nevertheless, Intel is going to reequip its facilities completely only when they shift to 65nm production process.

Prescott Core

Prescott core is completely different from the previous processor dies used in Pentium 4 CPUs. It would be absolutely incorrect to claim that the new Prescott is none other than the same Northwood core featuring larger cache-memory and manufactured with finer technology process. The differences between the newcomer and the predecessor are much more essential. In fact, the example is right here: look at the photo of the Prescott processor core:

When they worked on Prescott many of the internal functional units were developed anew. Besides, the engineers resorted to automated design a lot. As a result, Prescott core looks very much different from all other dies. Unfortunately, we can’t clearly single out separate functional units on the picture above. Many processor parts appeared simply “spread over” the entire die. The thing is that at the development stage they optimized the Prescott core layout so that to ensure high clock frequency potential and even heat dissipation all over the die. As a result, overheating of separate functional units will not be as typical of the Prescott core as it is of any other processor die. This way, we can state that Intel gave up the old processor development algorithm, when all the processor functional units were developed separately and then put together into the same die.

As for the basic features of the Prescott core, they are given in the table below, compared with the predecessors from Intel and competing solutions from AMD:

 

Intel Pentium 4

Intel Pentium 4

Intel Pentium 4 Extreme Edition

AMD Athlon 64

AMD Athlon 64 FX

AMD Athlon XP

Processor core

Prescott

Northwood

Gallatin

ClawHammer

SledgeHammer

Barton

Socket

Socket 478

Socket 478

Socket 478

Socket 754

Socket 940

Socket A

Frequencies

2.8-3.4GHz

1.6-3.4GHz

3.2-3.4GHz

2.0-2.2GHz

2.2GHz

Below 2.2GHz

Production technology

0.09 micron, «strained» silicon

0.13 micron

0.13 micron

0.13 micron, SOI

0.13 micron, SOI

0.13 micron

Number of transistors

125mln.

55mln.

178mln.

105.9mln.

105.9mln.

54.3mln.

Die size

112 sq.mm

131 sq.mm

237 sq.mm

193 sq.mm

193 sq.mm

101 sq.mm

L1 data cache

16KB

8KB

8KB

64KB

64KB

64KB

L1 instructions cache

12000 uops

12000 uops

12000 uops

64KB

64KB

64KB

L2 cache

1024KB

512KB

512KB

1024/512KB

1024KB

512KB

L3 cache

-

-

2MB

-

-

-

SIMD instructions

SSE3/ SSE2/ SSE

SSE2/ SSE

SSE2/ SSE

SSE2/ SSE/ 3DNow!

SSE2/ SSE/ 3DNow!

SSE/ 3DNow!

x86-64 support

-

-

-

+

+

-

Integrated memory controller

-

-

-

Single-channel DDR SDRAM

Dual-channel DDR SDRAM

-

As we see, Prescott has more than twice as many transistors as Northwood. However, it is not because of the twice as big L2 cache, because l2 cache of the Prescott processor occupies only 25% of the die size. Also this number of transistors is very unlikely to be required for larger L1 data cache or support of 13 SSE3 instructions. It looks as if there were other important reasons for these extra transistors to appear in the new Prescott die. Let’s try to find out what they are.

Prescott's Pipeline

Like the previous Willamette and Northwood cores, the new Prescott core is based on NetBurst micro-architecture introduced in the first Intel Pentium 4 processors. The major idea behind this architecture is to achieve high CPU performance by raising the clock frequency. It is no secret for anyone today that Pentium 4 core clock frequency looks really impressive against the background of other processors available in the market. This idea continued developing in the new Prescott CPUs. Intel made a few changes to the core, which allowed them to make another significant increase in the clock frequency potential. Besides the semiconductor technologies, “strained” silicon and special automated core design techniques, the changes have also touched upon the micro-architecture itself. Intel even mentioned that “Prescott is based on enhanced NetBurst architecture”.

No doubt that the major key to higher working frequencies is the longer execution pipeline. In this case commands execution is split into multiple simpler stages, which allows speeding up their execution, thus increasing the commands feed speed to the pipeline. When Intel announced Pentium 4 processor family, the execution pipeline got 20 stages long (it used to be 10 stages in Pentium III). We are still witnessing the effect of this change: if the maximum clock frequency of the Pentium III processor has never exceeded 1.5GHz, then the today’s Pentium 4 CPUs can easily work at the frequencies beyond 3GHz. Intel continued this successful tendency and made the execution pipeline of the new Prescott processor even longer than that.

Therefore, Intel hopes that its new Prescott based CPUs will be able to reach 4.5GHz clock frequency. Of course, they will have to increase the pipeline quite significantly if they really want to achieve this ultimate goal. However, Intel doesn’t disclose any information about the real length of Prescott’s execution pipeline, although they claim that it has at least 30 stages now.

Moreover, we undertook some empirical calculations trying to figure out the length of Prescott’s pipeline. Our assumptions were based on the time it takes the CPU to refill the pipeline in case of a wrong prediction. Our estimates showed that Prescott’s pipeline should be around 35-36 stages long!

At the same time, we shouldn’t forget that there is another side to the picture called “longer pipeline and higher clock frequency”. Firstly, the higher is the CPU core clock frequency, the more tangible is the core idling time when there is no data in the cache for further work of the CPU. We all know that the memory subsystem of contemporary platforms is very slow compared with the processors’ computational units. Moreover, two ALUs out of three in CPUs based on NetBurst architecture work at the double core frequency. Therefore, the CPU wastes a lot of time waiting for the new data to appear within its reach causing catastrophic idling. Secondly, longer pipeline causes a lot of trouble in case of wrong branch predictions. In this case execution units slow down and the CPU has to clear the entire pipeline and then refill it anew, which definitely takes more time, as the pipeline has become longer.

Two above described problems of a long processor pipeline set two major tasks for Intel engineers. They had to do their best to eliminate the negative effect of the longer Prescott pipeline, so that the overall processor performance didn’t turn into a failure for Prescott. Especially now, when it is simply impossible to achieve higher working frequencies because of the yet improper production technology, which needs to be better polished off first.

We will not discuss Intel NetBurst architecture today. If you are looking for more detailed materials on it, please check our article called Intel Pentium 4 1.4GHz Review. Part 1: Processor Architecture and Platform Overview. And now let’s find out what changes have been made to the new Prescott compared with the previous Northwood core.

Prescott’s Enhanced Architecture

With the launching of the new Prescott processor Intel made a significant step forward towards successful improvement of their NetBurst architecture. The picture below shows something like a NetBurst genealogical tree with highlighted improvements introduced in the new Prescott processor core.

Let’s discuss these improvements in a bit more detail now.

Improved Branch Predictor. Most processor delays are caused by the necessity to clear and refill anew Prescott’s long pipeline after incorrect branch predictions. Therefore, the best way to eliminate these delays is to avoid incorrect predictions at all. Although the branch prediction algorithm of the NetBurst architecture was very efficient from the very beginning, Intel managed to improve this efficiency even more now.

The work of the Branch Prediction unit in Intel processors with NetBurst architecture is based on the work with Branch target Buffer (BTB). It is a 4KB buffer storing the statistics about the already complete branching. In other words, Intel’s branch prediction is based on a probabilistic model: the CPU evaluates a given branch as preferable or not in each particular case according to the collected statistical data. This algorithm proved very efficient, however, it turns out absolutely useless if there is no statistics about a certain branch. The Northwood based CPUs selected a “backward” branch in this case, considering that quitting cycles is the most widely spread branch.

This statistical algorithm of branch predictions has been significantly improved in the new Prescott core. Now, if there is no statistics about a certain branch, the branch prediction unit doesn’t draw any definite conclusions about the branch direction. Since the backward branches are usually not any longer than a certain empirically calculated branch distance, the branch prediction unit bases its decision on the branch distance for this particular case.

Moreover, the dynamic branch prediction algorithm has also been slightly improved. Prescott processor acquired an indirect branch predictor, which was first used in Pentium M processors and proved highly efficient there.

So, if Northwood based processors boasted the average of 0.86 incorrect predictions for every 100 instructions, then the new Prescott boasts a lower value of 0.75 for every 100 instructions. In other words, we got 12% less incorrect branch predictions, which leads to fewer delays caused by the necessity to empty and refill the execution pipeline.

Faster Instructions Execution.  The new processor core has the same number of integer ALUs: there are two integer ALUs working at the double core frequency for simple instructions and one more ALU one for complex instructions. However, the some instructions are processed much faster now. Prescott owes this performance increase to a few changes introduced in the ALU units.

First of all, I would like to mention that Intel added a shifter/rotator unit into one of the fast ALUs performing all instructions like shifts and rotations. As a result these instructions are now performed much faster, because in the previous Pentium 4 processors they were regarded as complex instructions and hence processed by the slow ALU.

The integer multiplication will also be performed faster by Prescott processors. In the previous versions of Intel’s NetBurst architecture integer multiplication was performed by the FPU, which required operands to be translated into floating-point format and then back to the integer format. In Prescott processor the integer multiplication is performed by the integer ALU, which definitely works considerably faster.

According to the measurements, the shifts and rotations are now performed at least 4 times faster, while integer multiplication got 25% faster. However, we should still keep in mind that longer pipeline and different L1 cache working algorithms have affected the time required for other simple instructions processing. Many instructions, which used to require about half a clock cycle, now take the entire clock that is why it wouldn’t be correct to state the overall ALU performance improvement.

Improved Data Pre-Fetcher. This improvement should help solve the problem with delays when there is no data in the cache for further processing – a very unpleasant situation when the CPU is idle waiting for the data to come from the memory to the cache. As we have already said, Prescott has twice as large L1 and L2 data caches. Besides that, Intel has also improved the data prefetch algorithms.

Intel improved not only the software data prefetching initiated by the running application, but also the hardware data prefetch mechanism. As for the latter, the CPU processes the software prefetch instructions even if the information about the requested data is absent in the TLB. Moreover, these instructions can be cached in Trace Cache. However, it didn’t prove very efficient, because the existing compilers do not distribute the software prefetch instructions in the code, so hardware prefetcher improvement is much more important here. According to Intel, the new hardware prefetch algorithm of the Prescott processor tracks the data as well as the code flow and ensures a performance improvement of about 35%.

Besides the already mentioned changes, I would also like to point out that Prescott features more Write Combining buffers, which allows performing a lot of instructions such as data saving and loading simultaneously.

The picture below is a flow-chart for Prescott processor:

In fact, this flow-chart shows that there are no structural changes in the new Prescott core compared with the previous solutions on NetBurst architecture. The most evident differences are the size and the structure of L1 and L2 caches, which we will discuss later today.

Cache and Memory Subsystem

Cache-memory seems to have undergone most changes in the new Prescott core compared with the predecessors. At least the different sizes of the cache-memory in Prescott and Northwood can be noticed with a naked eye. L1 data cache in Prescott processor grew from 8KB to 16KB, L2 cache grew from 512KB to 1MB. As for the structure of Prescott’s cache-memory, I should say that L1 data cache features 8-way set associativity with 64-byte string length. It works according to Write Through algorithm. In other words, the number of associative zones in L1 cache has doubled compared with Northwood core. The L2 cache of Prescott is almost the same as that of Northwood: it is also an 8-way cache working according to Write Back algorithm and contains 128-byte long strings. L2 cache in Prescott core features a 256bit bus, which is just like by Northwood.

Theoretically, the increase in the cache-memory size is another way to combat processor idling caused by the absence of data for further processing. That is why as the core clock frequencies increase and the gap between the CPU performance and memory speed grows bigger, efficient data cache-memory becomes more and more important. This way, the enhancement of L1 and L2 cache of the new Prescott processor core is a very significant change, especially since this core was initially developed for higher clock frequencies.

As for the L1 cache for instructions, it is known as Execution Trace Cache in NetBurst architecture, because it stores instruction sequences in the already decoded form. Its size and structure remained unchanged: it can store up to 12,000 micro-operations, which is equivalent to 8-16KB of ordinary instruction cache.

However, let’s check the actual speed of Prescott’s cache-memory, especially since there are new surprises waiting here for us. To measure the performance and latency of the cache and the memory we resorted to Cache Burst 32 utility. The test system where we performed all the measurements was based on ASUS P4C800-E Deluxe mainboard on i875P chipset and featured dual-channel DDR400 SDRAM with 2-3-2-6 timings. For our experiments we used Pentium 4 processor on Northwood core, Pentium 4 processor on Prescott core and Pentium 4 Extreme Edition, all working at 3.2GHz. For a more illustrative comparison we also considered the results shown by Athlon 64 platforms. One of the competitor systems was built on Athlon 64 FX-51 CPU working at 2.2GHz and dual-channel Registered DDR400 SDRAM with the timings set to 2-3-2-6, and the second competitor system featured an Athlon 64 3400+ working at 2.2GHz core clock frequency and supported DDR400 memory with the same timings. All other components of our testbeds do not affect the results of this test that is why we will not mention them here.

First of all we measured the bandwidth of the memory subsystems of our platforms. You can see the memory performance comparison for systems based on Pentium 4 (Prescott), Pentium 4 (Northwood) and Pentium 4 Extreme Edition when we worked with data blocks of different sizes.

However, the graphs allow only a qualitative analysis. To draw more indepth conclusions, we will turn to exact performance numbers including those obtained on AMD platforms as well.

 

L1 data cache

L2 cache

Memory

Size

Bandwidth (reading), MB/sec

Bandwidth (writing), MB/sec

Size

Bandwidth (reading), MB/sec

Bandwidth (writing), MB/sec

Bandwidth (reading), MB/sec

Bandwidth (writing), MB/sec

Prescott 3.2

16KB

44492

10832

1024KB

24793

10799

5006

1777

Northwood 3.2

8KB

45546

13891

512KB

25618

13909

4297

1756

Pentium 4 XE 3.2

8KB

45526

13877

512KB

25693

13891

4238

1919

Athlon 64 FX-51

64KB

29323

16638

1024KB

10177

8438

3559

2418

Athlon 64 3400+

64KB

29359

16664

1024KB

10323

8448

2907

1364

As we see, although Prescott boasts a larger L1 cache than Northwood, its bandwidth is somewhat lower, especially on writing. A similar thing happens to L2 cache bandwidth. However, when we compare the cache-memory bandwidths of Pentium 4 and Athlon 64 processors, the Intel solutions will be indisputable leaders due to wider bus between the L2 cache and the processor core.

Here I have to point out one very curious detail. Even though Prescott’s cache-memory performs all reading and writing as fast as Northwood, the copy speed of the 90nm processor is considerably higher than by Northwood. This effect can be explained by the fact that data loading and storing by Prescott core have also been additionally improved, so that the CPU can start using the preliminarily stored data even before they have been moved to the cache. This is possible due to a special Store Forwarding Buffer.

When we tested the processor’s work with the memory, we discovered another surprising fact. Prescott reads data from the memory faster than Northwood. The CPU owes this pretty tangible result to the improved Data Prefetcher, which we have already discussed above.

Besides the bandwidth, we also care a lot about another parameter characterizing memory subsystem and caches: the latency.

Unfortunately, even a quick glance at this graph is more than enough to understand that the latency of L1 and L2 caches in the new Prescott processor has grown much higher compared with the previous Northwood core. And here are the numbers:

 

L1 data cache

L2 cache

Memory

Size

Latency, clocks

Latency, ns

Size

Latency, clocks

Latency, ns

Latency, clocks

Latency, ns

Prescott 3.2

16KB

4

1.25

1024KB

28

8.75

251

78.43

Northwood 3.2

8KB

2

0.625

512KB

19

5.94

236

73.75

Pentium 4 XE 3.2

8KB

2

0.625

512KB

19

5.94

240

75.00

Athlon 64 FX-51

64KB

3

1.36

1024KB

13

5.91

113

51.36

Athlon 64 3400+

64KB

3

1.36

1024KB

13

5.91

101

45.91

Yes, unfortunately, we have to state that not only the size of Prescott’s cache-memory has grown bigger, but also its latency. And the latency grew up a lot, I should say: for L1 cache the latency doubled! As a result, Intel will no longer be able to boast the extremely low latency of its L1 data cache. From the temporal point of view, the latency of Pentium 4’s L1 data cache got close to that of Athlon 64 L1 cache, though the latter is four times larger. However, the increase in the L1 cache latency is another forced measure, so that the new Pentium 4 processors on Prescott core could go beyond 4GHz core frequency.

Similar changes were made to the L2 cache, too. In terms of L2 cache latency, the new Prescott processor yields to Northwood, as well as to the competing CPUs from AMD Athlon 64 family.

Although theoretically the memory latency of the Prescott processor had to remain unchanged, we see that it got worse in this case too.

As a result, we have to admit that hunting for the high clock frequency potential in its Prescott core, Intel sacrificed the latencies during the work with data. But on the other hand, we shouldn’t forget that Intel has also applied some new techniques aimed at improving the memory buses efficiency. And the higher bandwidth during data copying as well as higher data read speed from the memory are clear evidence of that.

Hyper-Threading Technology

We have all known since a while ago, that Hyper-Threading technology in the new processors on Prescott core will undergo certain improvements. However, there has been a lot of speculation about what particular improvements these will be. Some people though that Prescott will be recognized by the operation system as four logical CPUs, the others expected Prescott to cope better with those situations when one thread blocks the execution of another one. However, neither of these actually took place.

Pentium 4 processor on Prescott core is recognized by the operation system as two logical CPUs. And as our practical experiments shows, blocked threads can slow this CPU down even with the enabled Hyper-Threading technology.

Let me tell you a bit more about our experiments. To test the Hyper-Threading technology of Prescott processor in real conditions, we developed a small program, which created two threads in the system. The first thread simply adds integers and in the end puts up a flag indicating that the task has been completed. The second thread is an empty spin-wait cycle, which end only when the first thread puts up the end-flag.

When we ran this pretty simple program on Northwood based processors, the results turned out dramatically low. The empty cycle locks the execution units of the CPU, which immediately slows down the processing speed for the first calculations thread.

Just take a look at the results. At first come the results for Pentium 4 (Northwood) working at 3.2GHz with enabled and disabled Hyper-Threading technology:

Northwood, HT enabled
   

Northwood, HT disabled

The first number produced by our test stands for the time it takes to complete the calculations, when this thread is running alone. The second number stands for the time it takes to complete the calculations when there is a second thread running in parallel to the first one (the empty spin-wait cycle, I have already mentioned to you). As we see, Hyper-Threading technology harms the performance a lot. It takes 2.5 times more time to complete this test when Hyper-Threading is enabled: the empty cycle blocks the processor execution units hindering the processing of the tasks in the first thread. Even though, our test I purely synthetic, situations like that sometimes happen when we run some multi-threaded applications. We have just successfully modeled the case when Hyper-Threading does harm the processor performance.

The third and the fourth numbers produced by our test indicate the time it takes to run both threads in case we apply different optimizations preventing the threads from blocking one another. In one case we used a PAUSE instruction, and in another – we used a special synchronization object of the Windows operation system.

Now let’s see how Prescott will cope with our tricky test:

Prescott, HT enabled
   

Prescott, HT disabled

As we see, the situation doesn’t actually change. Just like in the previous cases, our test reveals all the bottlenecks of Hyper-Threading technology, which we have already seen by Northwood.

However, I have to admit that some improvements to the technology in Prescott core have still been made. Firstly, the SSE3 instructions set introduced in the new 90nm processor offers the software developers some new opportunities for better threads synchronization (we will talk about it in a special section of our today’s article). Secondly, Prescott learned to run certain processes in parallel, although Northwood could perform them only separately. Not going deep into details, I would like to say that it is about simultaneous work of several threads with the cache-memory.

SSE3 Instructions Set

Another innovation in the Prescott core is the introduction of the new SIMD instructions set, which was first known as PNI (Prescott New Instructions), but then got a new marketing name – SSE3. In fact, I do not think it would be fair to call the SSE3 instructions set fully-fledged. SSE3 includes only 13 new instructions, which doesn’t look serious enough, especially against the background of the pervious SIMD instructions sets from Intel offering over a few dozens of new instructions. Moreover, SSE3 is not a new instructions set developed for some specific tasks. It is none other but a few additional isolated instructions, which kind of “correct a few bugs” in the already existing sets.

SSE3 includes the following new instructions:

Although Intel released the SSE3 instructions guidelines for software developers last summer, there are no programs yet, where the new instructions could really be used. However, we know for sure that they are to come very soon now. First of all, SSE3 instructions will be used in various video codecs, because according to Intel, LDDQU instruction could speed up video compression by 10% if used in data encoding algorithms. By the way, the new version of Intel C++ 8.0 compiler supports SSE3 instructions, which means that other software employing SSE3 is already on the way.

Intel’s Roadmap

So, the major idea pursued by Intel’s engineers when they worked on the new Prescott core was to develop a processor, which will be more scalable in terms of clock frequency than its predecessors. However, despite this fact, the maximum clock frequency of the today’s Prescott CPUs is only 3.4GHz. Although Intel claims that Prescott based CPUs will go beyond 4.5GHz next year.

According to the company’s roadmap, Prescott core will be used for Pentium 4 CPUs for the next year or a little bit more than that. During this period of time Intel will use the frequency potential of the newcomer. This is what Intel’s plans looks like regarding the introduction of the new processor models throughout the next year:

In Q2 2004 we should see Prescott based CPUs working at 3.6GHz, and in Q3 Intel is planning to announce Pentium 4 3.8GHz. By the end of the year 2004 there should appear Pentium 4 4GHz. Next year Prescott core will continue speeding up until Tejas replaces it. Tejas will be manufactured with the same technology process (maybe slightly modified by then) as Prescott. However, it will also get a number of improvements, which will raise the maximum clock frequency even higher than 4.5GHz. Of course, they will again increase the length of the execution pipeline and as a result, again do something to reduce the negative influence of the longer pipeline on the processor performance: namely increase the cache-memory size, improve the branch prediction algorithms, etc.

They will also move the low-cost Celeron processor family to the new Prescott core. The first Celeron CPUs based on 90nm Prescott will appear in the market in Q2 2004. Prescott core will be slightly cut-down for the low-cost segment: Celeron will have 256KB L2 cache and 533MHz bus, while Hyper-Threading technology will be simply disabled. Celeron processors will also work at lower clock frequencies than their Pentium 4 fellows. For instance, the first Celeron processors on Prescott core due in Q2 will work at the maximum frequency of 3.06GHz.

Speaking about Intel’s plans for the next year I should say a few words about the compatibility of contemporary mainboards with the upcoming Prescott based processors. Unfortunately, I have to state that the last Prescott based processor, which will be fully compatible with the today’s mainboards is the 3.4GHz models announced today. Despite the fact that Vcore of the new Prescott has been reduced to 1.25-1.4V, the CPU still requires pretty high current, which is not any lower than that required by Northwood working at the same core clock. Contemporary mainboards are not intended to support such power-hungry CPUs and simply can’t produce the current high enough for processors faster than 3.4GHz. In order to avoid any confusion with the compatibility of the new Prescott based processors and already existing mainboards, Intel decided to design all CPUs working at 3.6GHz+ for a totally new socket form-factor known as LGA 775. Mainboards for LGA 775 processors will be released in Q2 together with the Pentium 4 3.6GHz, which will not be modified for the current Socket 478. Together with the new processor socket we will also see the new chipsets coming out within the same period of time, however, this is a totally different story already.

In conclusion to our discussion of Intel’s upcoming plans I can’t help mentioning the problem of 64bit extensions to IA32 architecture. Until quite recently, Intel has been denying the possibility to introduce 64bit extensions like x86-64 from AMD in its IA32 processors. However, the company’s position has become much more flexible lately: now the company’s officials say that 64bit extensions can be introduced as soon as there appears corresponding software, which will be able to use the new advantages. Keeping in mind that 64bit user version of Windows XP operation system is scheduled for the middle of this year, we dare suppose that the new Prescott core already has these 64bit extensions implemented, but they will remain deactivated until the right time comes. This way I wouldn’t deny that we might soon see Intel’s x86-64 processors in the market in the nearest future. Although, I wouldn’t also make any forecasts yet…

Thermal Conditions and Overclocking

The thermal conditions of the new Prescott processor core is a very “hot” topic. When Intel engineers developed the processor, they faced a problem of high leakage currents, which lead to excessive power consumption and high heat dissipation. Although Prescott based processors are manufactured with more advanced production technology and feature lower Vcore, Intel still had to update the power requirements to mainboards and CPU voltage regulator circuitry – Prescott FMB 1.5. Moreover, the CPUs working at over 3.4 GHz frequencies are incompatible with the today’s mainboards exactly for this particular reason: high power consumption.

At the same time Intel claims that Prescott CPUs with the working frequencies below 3.4GHz will be compatible with all Socket 478 mainboards supporting top Pentium 4 (Northwood) processors today. The only thing the mainboard guys should do, is to update the BIOS of their solutions to make sure that it recognized the CPU correctly. As for the coolers for the new Prescott processors, there are no specific requirements here.

Nevertheless, Prescott based CPUs dissipate more heat than Northwood based ones. For instance, you can take a look at the table below, with the TDP values (Thermal Design Power) for Northwood and Prescott, and also Pentium 4 Extreme Edition:

 

Pentium 4 (Prescott)

Pentium 4 (Northwood)

Pentium 4 Extreme Edition

Vcore

1.25-1.4V

1.475-1.55V

1.475-1.55V

3.4GHz

103W

?

?

3.2GHz

103W

82W

92.1W

3.0GHz

89W

81.9W

-

2.8GHz

89W

69.7W

-

As we see, the current Prescott revision is already the third one, and the previous two were canceled exactly because of the too high heat dissipation. However, the mass Prescott based CPUs are really a way too “hot” even compared with the Pentium 4 Extreme Edition, which consist of more transistors. This way, the thermal conditions inside Prescott based systems promises to become much more severe.

However, let’s take a look at the practical aspect of this matter. We measured the actual temperatures of Pentium 4 Prescott, Pentium 4 Northwood and Pentium 4 Extreme Edition working at 3.2GHz. For our tests we used the same cooler taken from the boxed supply (the boxed versions of all the three CPUs come with one and the same cooler. The temperatures were measured according to the built-in on-die thermal diode.

We measured the minimal CPU temperature in idle mode and maximum CPU temperature in burn mode, when the CPU was warmed up with the help of special utilities:

 

Idle

Burn

Pentium 4 (Prescott) 3.2GHz

45oC

61oC

Pentium 4 (Northwood) 3.2GHz

30oC

48oC

Pentium 4 Extreme Edition 3.2GHz

32oC

51oC

I don’t think I need to comment on these numbers. Prescott processors warm up much more during active work than their predecessors. Note that we measured the CPU performance during the tests carried out in an open testbed. I am scared to imagine what happens to Prescott when we close the system case…

As a result, there is no doubt that Prescott is the warmest x86 processor today. When you purchase a system based on this CPU, you should always keep in mind this fact. Moreover, Intel introduced new requirements for case manufacturers and system builders. The main idea of these requirements implies that they have to provide low air temperature in the CPU area. On our part, we can only agree that you should pay special attention to thermal issues and proper cooling of your system and your CPU when working with a Prescott based system.

Besides the thermal tests, we also checked the overclocking potential of the new Prescott CPU. This experiment will give us some idea of the frequency potential in C0 core stepping of the new Prescott processors. For our tests we took Pentium 4 (Prescott) with the nominal frequency of 3.2GHz. We didn’t use any special cooling solutions, besides a traditional boxed cooler. Top achieve better results we increased the processor Vcore to 1.475V.

During our overclocking experiments we managed to raise the FSB frequency from the nominal 200MHz to 225MHz, so that the CPU got overclocked to 3.6GHz.

I wouldn’t call this result impressive. Especially taking into account that Intel claims the frequency potential of up to 4.5GHz. But note that this is only one of the very first core steppings. As the manufacturing technology improves, ongoing core revisions designed for LGA 775 Prescott versions will boast much higher overclocking potential, for sure.

Besides, I should also say that we could have achieved better results during our overclocking attempts if we had used some more advanced cooling solutions. When the CPU worked at 3.6GHz it warmed up to 68-70oC. The maximum Tcase temperature for Prescott based CPUs makes 73.5oC. This way there is no doubt that overclocking was limited by the fast growing temperature of the processor core. So, some extreme overclocking fans, who have water cooling or cryogen cooling systems at their disposal, will be able to squeeze much more MHz out of their Prescott based CPUs.

A Bit of Performance Tests

Winding up our introductory article to the new Intel Prescott processor core and its features we decided to run a few benchmarks, which will give you some idea about the performance of the CPUs based on the new core. We took a popular SiSoft Sandra 2004 test package, because it contains a few simple algorithms, which can involve or not different functional units of the processor upon the user’s request. Besides, these tests are so simple that they do not depend on the size and performance of the L2 cache-memory as well as on the efficiency of the memory subsystem. In other words, other system components do not impose any influence on the CPU performance during the tests.

We will test Pentium 4 (Prescott) and Pentium 4 (Northwood) working at 3.2GHz core frequency. The table below contains the results obtained in SiSoft Sandra 2004 measuring the processor performance when building the Mandelbrot set:

 

Pentium 4 (Northwood)

Pentium 4 (Prescott)

Hyper-Threading Enabled

Hyper-Threading Disabled

Hyper-Threading Enabled

Hyper-Threading Disabled

Integer SSE2

24520

19890

22637

18114

Float SSE2

35492

23542

30468

20693

Integer SSE

19923

17674

19480

16756

Float SSE

33280

24764

29260

20843

Integer MMX

14791

13158

14877

12481

Float FPU

6470

3281

5780

2966

Integer ALU

9198

7101

16033

11686

We will not perform any indepth analysis of these results, as they are mostly intended to give us a general idea of the CPU performance. We will discuss the performance of the new Prescott based processor in a different article, where you will see the whole bunch of tests. Here I would only like to say that the performance of Prescott’s functional units responsible for FP/MMX/SSE/SSE2 instructions didn’t get any better compared with Northwood. Lower results obtained for FP/MMX/SSE/SSE2 unit of the Prescott processor were caused by higher L1 cache latency in the first place. Note that Prescott’s performance with ALU’s involved is higher than the performance of a Northwood based processor in the same test. This victory can be explained by the above described architectural enhancements, namely faster multiplication processing, which is essential during Mandelbrot set calculations.

Conclusion

In this article we discussed the major features and characteristics of the new Pentium 4 core also known under the Prescott codename. Although this core is based on Intel’s NetBurst architecture, we still see a lot of changes compared with Prescott’s predecessor, Northwood. The detailed analysis of these changes and improvements showed that all of them were first of all aimed at increasing the clock frequency potential of the Pentium 4 processor family. According to the available information, Prescott should be able to grow as high as 4.5GHz.

Although there are a lot of innovations introduced in the Prescott core, which should theoretically increase the performance of the new CPU, such as larger L1 data cache and L2 cache, the processors on this core are very unlikely to make a significant breakthrough in processor performance compared with the previous processors on NetBurst architecture. The thing is that they provided Prescott with a 1.5 times longer execution pipeline in order to increase the clock frequency potential. That is why all the innovations improving the processor performance now serve to make up for the negative influence of the longer execution pipeline. Besides, the increased cache-memory size caused a significant growth of the latencies of both caches, which can also slow down the processor in certain tasks.

Among the positive changes introduced to the new Prescott core, we should definitely mention improved branch prediction unit, improved data prefetcher and faster processing of some integer operations. Moreover, the 13 new instructions should also contribute to easy programming and more optimized work of the software developed for Prescott.

In conclusion I would only like to add that the greatest consumer drawback of the new Prescott core, which has nothing to do with its architecture, is a significantly higher heat dissipation compared with the previous solutions. It will push the mainboard makers to developing new platforms for this CPU and new cooling solutions for Prescott based processors faster than 3.4GHz.

Here I would like to end our introduction to Prescott architecture. In a little while we will continue our investigation of the new core peculiarities. And now I would like to invite our readers to check out the next article devoted to the actual performance of the new and older solutions from Intel and AMD in a great lot of various applications.