<%BANNER[top_768x90]%>
<%BANNER[banner_468x60_h]%>
<%BANNER[cpu_300]%>

Articles: CPU

Table of Contents

Intel Pentium Pro CPU introduced in 1995 was the first processor with P6 architecture. A lot of time has passed since then. New CPU generations replaced the older ones, however, the essence of the CPU architecture remained unchanged. All the processor families, such as Pentium II, Pentium III and Celeron, were based on the same core and differed only by the core size, L2 cache implementation and the presence of SSE instructions, which was characteristic of Pentium III CPU. Of course, this couldn't last for ever, and P6 architecture had to become obsolete one day. Surely, it is not the hardships Intel faced when trying to further increase its CPU clock frequencies and not the competition with AMD that matter here that much. Certainly, we wouldn't deny that Intel had some problems as soon as its Pentium III CPU reached 1GHz border: if you remember Intel had to call back its Pentium 1.13GHz because of its great instability. However, this problem can be easily solved if the manufacturing is transferred to 0.13 micron technology, especially, since it is about to happen in the nearest future anyway.

The real reasons for Intel to introduce a new architecture are not lying on the surface. Unfortunately, further CPU working frequency increase doesn't provide significant performance growth any more. The problem lies with the higher latencies in P6, which turn up when different CPU subunits are requested. This was one of the reasons that pushed Intel towards Pentium 4 development. So, the recently announced Pentium 4 CPU is a totally new processor, which has hardly anything in common with its predecessors. It is based on a new type of architecture aka NetBurst. This name is intended to stress that the new CPU aims at speeding up data stream processing, which is directly connected with the rapidly developing Internet.

Intel NetBurst Architecture

First of all, let's try to find out the peculiarities of the new architecture. NetBurst can boast a number of interesting innovations ensuring huge speed potential and the possibility to easily increase the working frequencies for the future Pentium 4 CPUs. Among the major technologies implemented in NetBurst Architecture, we would like to mention the following:

  • Hyper Pipelined Technology. Pentium 4 features an incredibly deep 20-stage pipeline.
  • Advanced Dynamic Execution. Improved branch prediction and out-of-order execution.
  • Trace Cache. Pentium 4 uses a special cache for decoded instructions caching.
  • Rapid Execute Engine. Pentium 4 ALU works at twice the CPU frequency.
  • SSE2. Enlarged set of instructions for data stream processing.
  • 400MHz System Bus. A new system bus.

And now we would like to say a few words about each item we have just mentioned for you to get a better idea of the new monster from Intel.

Hyper Pipelined Technology

Intel called its Pentium 4 pipeline "Hyper Pipelined Technology" because of its comparatively great depth: 20 stages! Just for you reference: Pentium III pipeline has only 10 stages deep. What did Intel aim at with a deeper pipeline like that? Due to the fact that the execution of each command is divided into smaller parts, which appear easier and faster to execute than the entire command, nothing prevents the developers from rising the CPU frequency. If the today's 0.18 micron technology allows achieving only 1GHz for Pentium III processor (or 1.13GHz if you want to sound more optimistic), the future Pentium 4 processors will be able to support up to 2GHz working frequency.

However, a deeper pipeline isn't free from its drawbacks. The first one is evident: since there are more stages to execute before the operation is completed, the overall time required for each operation increases. That's why in order to make sure that younger Pentium 4 models prove faster than the elder Pentium III CPUs, Intel starts its new processor family at 1.4GHz. If Intel launched a 1GHz Pentium 4, it would undoubtedly be beaten by a 1GHz Pentium III CPU.

The second drawback of a deeper pipeline worth mentioning here comes to light in case a branch prediction error occurs. Like any other modern CPU, Pentium 4 is capable of executing instructions in succession as well as in parallel. In the latter case the instructions do not always follow the order they are listed in the program and the branches aren't always correctly predicted. In order to choose the right branch for further execution the CPU predicts the results judging by the collected stats. However, if the processor mis-predicts a branch, all the speculatively executed instructions must be flushed from the processor pipeline in order to restart the instruction execution down the correct program branch. On more deeply pipelined designs, more instructions must be flushed from the pipeline, resulting in a longer recovery time from a branch mis-predict. The net result is that applications that have many, difficult to predict, branches will tend to have a lower average level of instructions per clock.

Advanced Dynamic Execution

Intel engineers worked really hard to provide Pentium 4 architecture with a great number of features aimed at minimizing the branch mis-prediction penalty and at increasing the percentage of correct predictions. All these enhancements were implemented in an Advanced Dynamic Execution engine. Intel provided a very large window of instructions for out-of-order execution and enhanced the branch prediction capability that allowed the Pentium 4 processor to be more accurate in predicting program branches. It was done mostly by implementing a larger branch target buffer, as well as by implementing a more advanced branch prediction algorithm.

So, there is an up to 126-instructions window used for choosing the next instruction to be executed vs. the much smaller window of 42 instructions typical of Pentium III architecture, for instance. The branch target buffer that stores more detail on the history of past branches was increased up to 4KB, while the buffer by Pentium III was only 512Byte big.

This, as well as the prediction algorithm modification, has the net effect of reducing the number of branch mis-predictions by about 33% over the Pentium III processor's branch prediction capability. This is a really good value, because it means that Pentium 4 offers over 90-95% of correct predictions.

Trace Cache

Pentium 4 doesn't have a regular L1 cache, which was divided into two parts by Pentium III: one for data and one for instructions. The approach is totally different. The instructions are no longer stored in L1 cache: it is intended solely for data. To cache the instructions Pentium 4 uses Trace Cache, which can boast a lot of advantages over the regular L1 cache, ensuring that all high-frequency execution units (integer and floating point) are kept busy and are prevented from sitting idle in case of branch mis-predicts.

The most important thing about Trace Cache is the fact that there are the decoded instructions cached in it. In other words, they are not the regular x86 instructions, but the so called micro-operations the processor core manipulates. Storing the micro-ops in the Trace cache allows avoiding repeated decoding of the x86 instructions if the program segment is executed once again in case of branch mis-predicts.

The second positive thing about the Trace Cache is the opportunity to retain the order of micro-ops execution when caching them. Even though the correct order is defined by the branch prediction results this is still a very reliable method, because, as we have already said above, the branch prediction capability of Pentium 4 is quite high and hence the mis-predicts are not that threatening.

Unfortunately, Intel didn't disclose the size of its Trace Cache, however, it is known to be able to store about 12,000 micro-ops.

Rapid Execute Engine

The simplest part of the today's CPU is ALU (Arithmetic Logic Unit). Since it is a relatively simply organized unit, Intel managed to make it run at two times the frequency of the processor core. So, the ALU of a 1.4GHz Pentium 4 CPU works at 2.8GHz.

ALUs execute simple integer instructions, therefore the new CPU should prove just perfect in integer operations. However, the doubling of ALU working frequency doesn't tell in any way on the Pentium 4 performance when working with floating-point operations, SSE or MMX.

So, ALU latency gets significantly lower. In particular, Pentium 4 1.4GHz spends only 0.35ns to execute an "add"-like instruction. Compare it with 1ns Pentium III 1GHz requires for the same instruction.

SSE2

When AMD implemented a new pipelined FPU in its Athlon processor, Intel Pentium III appeared unable to beat Athlon in floating point operations and fell significantly behind. However, with its Pentium 4 processor Intel decided not to focus on a simple enhancement of its FPU, but just enlarged the SSE instructions set. As a result, Pentium 4 features an extended SSE2 instructions set including 70 older instructions and 144 newer ones. This solution was also born by NetBurst ideology, which concentrates on speeding up the data stream processing.

SSE instructions allowed manipulating eight 128-bit XMM…XMM7 registers storing four single precision real numbers. Note that all the SSE instructions were executed simultaneously over the four-number sets, which provided a tangible performance gain in the specially optimized applications carrying out a lot of similar calculations (in fact, 3D games may also belong here, because it is not just data stream processing that takes place in gaming apps).

SSE2 manipulates the same registers and is backward compatible with the SSE of Intel Pentium III. And the instruction set got so much extended because the operations with 128bit registers can now be executed not only with four-number sets of single precision real numbers, but also with pairs of dual-precision real numbers, with 16 single-byte integers, with 8 short dual-byte integers, with 4 four-byte integers, with 2 eight-byte long integers or with 16-byte integers. In other words, the today's symbiosis of MMX and SSE, SSE2, allows operating all sorts of data fitting into 128bit registers.

So, SSE2 is much more flexible and allows achieving an incredible performance gain. However, the applications should be optimized specifically for SSE2 that's why it will hardly find wide application right after the new CPU is launched. SSE2 definitely has a very promising future. Therefore, even AMD is planning to implement SSE2 in its upcoming Hammer processor family.

The older applications, which do not make any use of SSE2 and rely totally on a regular arithmetic coprocessor, will hardly run any faster even on a Pentium 4 system. Moreover, though Intel claims that the Pentium 4 FPU has been also slightly enhanced, the new CPU now requires 2 clocks more (on the average) for the regular floating point operations than its Pentium III predecessor.

L1 Cache

As far as the Level 1 cache in Pentium 4 is concerned, it is used only for data, because all the instructions are now stored in the Trace Cache. However, Pentium 4 processor based on Willamette core has only 8KB L1 cache, which is quite a small one even compared to the 16KB data sector of the Pentium III L1 cache. However, Intel had nothing to do and was simply forced to make it so small because of the relatively large Pentium 4 die. Nevertheless, Pentium 4 architecture may also support a larger L1 cache that's why there is some hope that it will get bigger as soon as Intel shifts to 0.13 micron manufacturing technology and Northwood processor core.

Anyway, Intel did its best to make up for the small L1 cache size and to increase the processor performance. They used a new algorithm for accessing the L1 cache. As a result, its latency got down to 2 processor clocks vs 3 clocks by Pentium III. So, taking into account that Pentium 4 works at higher frequency, its L1 cache needs around 1.4ns to respond (for 1.4GHz CPU model) compared to nearly 3ns by Pentium III 1GHz.

Like the L1 cache in Pentium III, the cache in the new Pentium 4 is also write-through and 4-way set associative, and features 64byte cache line.

L2 Advanced Transfer Cache

The Level 2 Advanced Transfer Cache is 256KB in size. It can boast a wide 256bit bus and hence a higher cache bus bandwidth than AMD Athlon processors, which use a 64bit cache bus. However, unlike Athlon L2 cache, the one of Pentium 4 (as well as of Pentium III, actually) is not exclusive, i.e. it stores the copy of the L1 cache as a must.

Since Pentium 4 is intended for processing data streams in the first place, L2 cache working speed is one of the key issues for it. That's why Intel doubled the data path between the Level 2 cache and the processor core. This enhancement was possible only due the fact that the data is transferred from Pentium 4 L2 cache on each core clock, while in case of Pentium III, the data is transferred only on each second core clock. As a result, a 1.4GHz Pentium 4 processor can deliver a data transfer rate of 44.8GB/s (32bytes x 1 (data transfer per clock) x 1.4GHz = 44.8GB/s). This compares to a transfer rate of 16GB/s on the Pentium III processor at 1GHz.

Like Pentium III, Pentium 4 features an 8-way set associative L2 cache. Besides, it also has 128byte cache line, while Pentium III has a 32byte cache line only. Also the newcomer allows extracting the lines not only as a whole but also as two 64byte parts.

Speaking about Pentium 4 caching system we can't leave out the fact that NetBurst architecture also supports up to 4MB L3 cache. However, we will not see an L3 cache in Pentium 4: it is kept for starring in new server processors.

CPU

Well, having taken a brief tour all over the major parts of NetBurst architecture, which is the main trump of Pentium 4 processor, let's take a closer look at its formal specs list:

  • The chip is manufactured with 0.18 aluminum interconnect technology. Intel is going to shift to copper interconnect and 0.13 micron technology simultaneously.
  • Willamette core is based on NetBurst architecture. It contains 42 million transistors on 217sq.mm die, which is more than twice larger than the AMD Athlon or Pentium III die size.
  • It works only in special mainboards with 423-pin Socket423.
  • Features high-performance 400MHz Quad Pumped system bus.
  • 8KB L1 data cache. Trace Cache for decoded instructions allows up to 12K micro-ops.
  • Integrated 256KB Advanced Transfer L2 Cache working at the full core frequency and provided with a 256bit bus.
  • Vcore: 1.7V.
  • SSE2 SIMD instructions.
  • The currently available versions support 1.4GHz and 1.5GHz. Later Intel will also introduce a 1.3GHz version.

Intel Pentium 4 will be manufactured in FC-PGA package, however, the die will be covered by a special heat spreader lid protecting it against external damages. Pentium 4 will fit into a Socket423 with 423 pins, which will be of a slightly different size than the regular Socket A and Socket370.

 
 

Since the die size is considerably big, the heat dissipation will be quite high. In particular, Pentium 4 1.4 GHz working at 1.7V and consuming about 32A will dissipate about 52W of heat (1.5GHz Pentium 4 version dissipates about 55W). That's why Pentium 4 coolers should be also quite large and should feature bigger cooling surface.

If purchased in 1000-piece quantities, Pentium 4 is expected to sell at $819 for a 1.5GHz model and $644 for a 1.4GHz model. Pentium 4 1.3GHz, which is due on 29 January 2001, will sell at $409. Intel is planning to pursue a pretty aggressive pricing policy and to reduce the processor cost greatly in order to make it affordable for mainstream desktops.

  20 November 10 December 29 January
Pentium 4 1.5GHz $819 $819 $644
Pentium 4 1.4GHz $644 $574 $440
Pentium 4 1.3GHz - - $409

Chipset and System Bus

Since Pentium 4 is based on a totally new architecture, it requires a new chipset to support it. And taking into account that Intel positions its CPU as the best solution for applications processing data streams, the chipset should provide high throughput rates for the major buses, such as the system bus between the CPU and the chipset North Bridge, and the memory bus.

First of all we have to point out that Pentium 4 uses an absolutely new Quad Pumped processor bus working at 400MHz. Its bandwidth is three times higher than the bandwidth of the Pentium III processor bus and makes 3.2GB/sec. This helps to reduce the CPU idling while waiting for the new data sets to arrive. This high-speed bus is physically implemented by simple multiplication of the base frequency (which makes 100MHz for Pentium 4) by 4 in the processor bus controllers for the chipset and the CPU. It means that 400MHz frequency can be observed only on the interval between the CPU and the chipset.

Since the system bus is so fast, the memory subsystem should also provide at least 3.2GB/sec memory bus bandwidth, otherwise the system won't be well-balanced. So, when working on a chipset for its upcoming offspring, Intel decided to adapt i840 chipset supporting two Direct RDRAM channels for this purpose. As is known, PC800 RDRAM bandwidth makes around 1.6GB/sec, so that in case of two channels used the memory bus bandwidth equals the required 3.2GB/sec.

Frankly speaking, the drawbacks of the notorious RDRAM have been the talk of the town for ages already. Certainly, the No 1 disadvantage is the crazily high price of this memory type. However, if we look at the dual channel RDRAM from the technological point of view, its use in Pentium 4 systems will seem quite justified, we should say. Unfortunately, the alternative DDR SDRAM featuring similar memory bandwidth will appear only by the end of next year. But we have to stress that RDRAM looks good only in case of data stream processing. If the application requires non-sequential data access, RDRAM latency appears so high, that DDR or even SDR SDRAM turn out a much better alternative. Though the chipsets supporting SDRAM are most likely to turn up only in mid 2001 at the earliest.

So, here is a brief description of the Pentium 4 chipset from Intel: i850 (Tehama):

For the South Bridge Intel selected ICH2 microchip, which we have already discussed in detail in our i815E Chipset Review. And MCH Intel 82850 stands for the North Bridge. This North Bridge supports 400MHz system bus, AGP 4x and two Rambus channels each working with a pair of RIMM modules.

Mainboards

In fact, i850 chipset is quite expensive ($75). Besides, Pentium 4 mainboards should have a 6-layer PCB, which makes their manufacturing pretty complicated and expensive, too. That's why only a limited number of mainboard manufacturers expressed their intention to introduce products for Socket423. There will be only 8 mainboard manufacturers to offer their Pentium 4 products in the nearest future. Here is a brief description of some already announced boards:

Mainboard Chipset Form-Factor RIMM AGP PCI CNR Notes
AOpen AX4T i850 ATX 4 AGP Pro 5 1  
ASUS P4T i850 ATX 4 AGP Pro 5 0 CPU overclocking features
Gigabyte GA-8TX i850 ATX 4 AGP Pro 5 1 Integrated Creative CT5880 sound
Intel D850GB i850 ATX 4 AGP Pro 5 1  
MSI MS-6339 i850 ATX 4 1 5 1  

As you can see from the table above, all Pentium 4 mainboards are very much like one another: they all feature 4 RIMM slots and 5 PCI slots. Most mainboards will also be equipped with an AGP Pro slot supporting High-End professional graphics cards with higher power consumption.

ATX 2.03

Besides new mainboards and new coolers Pentium 4 will require, you will also need a new PC case complying with ATX 2.03 specification. There are two major reasons for this third "must":

Firstly, all Pentium 4 coolers have very big heatsinks weighing up to 450g, so that you won't be able to fasten them to the processor socket any more. There will be a special retention mechanism for these coolers, which will be fastened to the PC case with the help of four large bolts directly to the case. It means that you case should have the corresponding mounting holes.

 

The indisputable advantage provided by the cooler retention mechanism is the reduction of CPU EMI over other mainboard components, especially since the CPU works at higher frequencies.

Secondly, ATX 2.03 requires a special additional four-pin cable supporting 12V and 5V and connecting the mainboard with the power supply unit.

As we have already mentioned, Pentium 4 consumes much more power than all other today's processors that's why it needs extra power supply.

Continued: Processor Performance Analysis >>


<%BANNER[banner_468x60_f]%>

Discussion

Comments currently: 1
Discussion started: 03/07/05 12:56:50 PM
Latest comment: 03/07/05 12:56:50 PM

View comments

You must log in to add comments.

Forgot password? Registration

remember me



Latest materials in CPU section

Article Rating

Article Rating: 9.0000 out of 10
 
Rate this article:
Excellent
Average
Poor