Prescott: The Last of the Mohicans? (Pentium 4: from Willamette to Prescott)

We are proud to offer you the most detailed article on NetBurst architecture and its peculiarities. We applied our own approaches and methodology to studying the NetBurst architecture and we are ready to share some facts and details, no one has ever discussed before. This is the first article of the trilogy: more exciting stuff is coming within the next couple of days with absolutely exclusive material about what actually slows down Pentium 4 processors. You will not believe your own eyes when you see this! So, stay tuned and don’t miss the three exciting parts of X-bit’s indepth NetBurst Architecture investigation!

by Victor Kartunov
05/25/2005 | 11:45 AM

Chapter I: What This Article Is About and Why We Decided to Write It

It has been known for a long time that there are three things a man can watch for ever and ever: running water, burning fire and working people :) While we don’t dare deny the first two statements, for the IT article readers the third “universal truth” can be transferred into the following: we can endlessly argue whether Pentium 4 is a “good” or a “bad” CPU.

 

It is true, there are very few processor architectures in the world, which have caused as much of a discussion, as NetBurst micro-architecture, which is represented by the Intel Pentium 4 processors. These are the CPUs everybody have heard of today. The arguments about the advantages and drawbacks of this micro-architecture are still in full swing nowadays for multiple reasons: the architecture was revealed at a difficult moment for Intel, for the first time the performance of mass processors depended a lot on the optimizations for each specific architecture, the end-users appeared direct participants of the “megahertz race” for the first time also. And later on this race resulted into the decision to completely give up the frequency as a universal and adequate performance index.

All these factors have definitely affected the arguments. We in our turn also got carried away by the flow of these discussions: each of us has written at least one 1-mile long post in the forums, trying to prove his point. But the great variety of diverse opinions indicated evidently that not everything is smooth and clear in the description of the new micro-architecture. So, all of us decided to put an end to this endless argument and find out what the Intel Pentium 4 processor actually is.

Now let me introduce the team of people hidden behind this mysterious “we”. Maybe some of you will recall the articles devoted to AMD Athlon 64 (Opteron) processor micro-architecture, which were written by the same team of people we would like to once again introduce to you today. Please meet (in alphabetical order):

This way, the entire discussion promised to be (and finally turned out) really exciting. And the results of this discussion have been summed up for you by the “dedicated typist” Victor Kartunov aka matik. This is the easiest way of all to make him do something for the best of the human kind :)

I would only like to add now that while we were working on the materials for this article we once again were very sincerely amazed at the sophisticated imagination of the engineers and developers. So, Welcome into Intel Pentium 4!

Chapter II: Goals and Prospectives of Processor Micro-Architecture Development

When we were working on the K8 micro-architecture technology coverage we arrived at the conclusion that the best way would be not to repeat what the developer says about their micro-architecture, but to explain its peculiarities and features in a simple and easy to understand way. Moreover, it looked really appealing to us to focus on the unclear statements in the whitepapers as well as to find and bring up the matters totally omitted there. Of course, you may find some things overlapping the whitepapers you might have already seen, because any detailed micro-architecture study inevitably has to be based on the official documents about it.

Also I would like to stress that we decided to publish this article now that many details about the Pentium 4 micro-architecture are no longer that new to all of us on purpose. In fact, these details are so not new, that almost every IT editor is ready to explain to you the “disadvantages of a longer pipeline for branch prediction” without even interrupting his lunch. You probably wonder why we delayed this article for so long? Let me explain: a few things about the functioning of the Pentium 4 processor appeared far not that simple. Secondly, despite the same marketing “Pentium 4” name, the Northwood and Prescott cores are very different from one another in the way they behave. As a result, we very often had to investigate Prescott core and Northwood core separately. Thirdly, we wanted to offer you a thorough and well-constructed analysis of the micro-architecture details, so that those of you who are really into it could take their time and enjoy our discussion of a completely different level of technical details.

Moreover, when Tejas project had been cancelled, it was evident that Prescott was the last serious modification of the Pentium 4 core. Of course, they continued increasing the system bus frequency, the size of L2 cache memory and enabled the EM64T support, but all this is none other but external changes: the core will not undergo any further modifications. That is why it makes perfect sense to sum many things up right now and finally answer the question: what do we know about this micro-architecture? Can we explain the Pentium 4 functioning peculiarities with something more essential than the boring phrases about “a longer pipeline”? Can we finally answer simple questions like “what is the actual length of the Pentium 4 pipeline?”

Well, let’s see what our investigations led to in the long run. And since the search for the answer has very often been more interesting that the actual answer, we would like to tell you a few exciting stories about our search for the answers. Especially, since we will have to introduce quite a few specific terms and notions for better understanding of the topic.

Before we pass over to the actual micro-architecture discussion let me remind you of a few interesting things. Very important things.

No secret that the performance of any micro-processor architecture can be represented as the processor frequency multiplied by the number of operations per clock cycle. So, the bigger is each of the multipliers, the bigger is the end product. Ideally it would be best to increase simultaneously the working frequency as well as the number of operations to be performed per clock cycle. However, in real life the situation is much more complicated, even though the frequency growth is not contradicting the growth of the per-clock performance directly. To provide higher operations per clock we need to introduce special design changes, perform complex analysis of the commands interconnections and overcome the impossibility to split the algorithms so that they could be working in parallel.

The frequency growth is also a complicated task. In fact, we should optimize the design so that there would be about the same amount of work at each stage. If this rule isn’t followed, then the most loaded stage turns into a brake preventing further frequency growth. Moreover, higher working frequencies (other things being equal) automatically imply higher heat dissipation. In this case the transition to smaller technological norms could be of help, but this source also gets exhausted sooner or later (it is especially evident right now against the background of production difficulties with the 90nm process, all large chip manufacturers are going through). The thing is that while the geometrical norms of the production technology keep shrinking, new problems arise and turn out quite hard to solve.

During the x86 processor architecture development, Intel was working on increasing the number of commands performed per clock cycle. Every new CPU generation (Pentium, Pentium Pro) could process more commands per clock than the previous generation. In other words, both multipliers were growing, which resulted into rapid performance increase. This flow of things continued until they have almost completely exhausted the frequency potential of the P6 micro-architecture: they reached the 1400MHz frequency limit. The so-called swan-song of this micro-architecture were the Pentium III-S processors equipped with 512KB L2 cache (I am not taking into account Pentium M processors here). Although their performance was on a pretty good level, they were already yielding to the competitor solutions in many things. By the way, we have to give credit to this micro-architecture: they started with 150MHz frequency (0.5micron production technology, Pentium Pro 150MHz) and then went all the way up to 1400MHz with 0.13micron production process (the above mentioned Pentium III-S with 1400MHz core clock). In other words, the CPUs with this micro-architecture got their frequency increased by 9 times while the geometrical elements turned about 4 times smaller. In fact, we cannot name any other micro-architecture which could boast anything similar to that.

Of course, the developers started working on a successor to P6 micro-architecture long before its potential was exhausted. This micro-architecture was called NetBurst and its key distinguishing feature from the previous generation architecture was the totally different priorities. All developers’ efforts were dedicated not that much to the increase in the number of commands performed per clock cycle but to reaching the highest frequency possible for the same production process as that used for P6 architecture. We would like to stress that nobody faced the choice between higher performance per clock cycle and higher frequency, but the development priorities stayed with the higher frequency, of course.

Certainly, it was a smart marketing move. The generations of users taught to believe that “bigger” is “better” proved the marketing point with their wallets. But today we will not be focusing on the marketing that much: we are going to dwell on the features of the NetBurst micro-architecture, which made it possible to reach all those super-high frequencies. And of course, on the consequences of these features, too.

This way, we will take a closer look at Pentium 4 keeping in mind that all the innovations and enhancements ever made to it were aimed at reaching higher working frequencies. In other words, every innovations was first of all evaluated for efficiency for further working frequency growth.

Chapter III: Introducing New Terms in the Micro-Architecture Field

For our further discussion it would make sense to recall the schematics of the Pentium 4 processor: the things we already know about the Pentium 4 micro-architecture, which are the same for all generations of these processor cores. And then in our ongoing investigation we will specify the differences distinguishing the Willamette and Northwood cores from Prescott. Here we would like to point out right away that the “default” core we are talking about in our case will be Northwood: as we have already mentioned in the previous chapters, Northwood and Prescott differ significantly in a few important aspects. In particular, the time it takes Prescott core to perform certain commands has changed a lot (see our Appendix 1 for more numeric data on that). It means that the core has been modified dramatically. However, this is definitely no news to you. The news is how greatly the core has actually been modified. Although the major traits of the NetBurst micro-architecture are still very distinguishable (longer pipeline, Trace cache, etc.).

So, in November 2000 Intel introduced to the public a new processor: Pentium 4. Together with the new CPU they introduced a new paradigm aka NetBurst, which was intended to help continue increasing the CPU performance. We will return to a more detailed discussion of the NetBurst architecture in the ongoing chapters, and now please take a look at the flow-chart for the Willamette processor core. Now we are going to postulate the major working principles of individual processor units. Right now we will neglect the differences between various processor cores (such as the production technology and the size of L2 cache, which are different by the newer Northwood core compared to the older Willamette). Therefore, this core suits perfectly well to illustrate the general principles of the Pentium 4 micro-architecture.


The numeric data on this chart, such as bus bandwidth (3.2GB/s) and L2 cache (256KB)
are given for the very first Pentium 4 processor models,
but it doesn’t matter for the general processor micro-architecture discussion.

You can see that the CPU is composed of a few functional blocks:

  1. Execution units and their auxiliary units (Back End block);
  2. Units responsible for instructions decoding and their timely transfer to the first block (Front End block). A few units responsible for certain specific features also belong here. They are: Prefetch unit and Branch Prediction Unit. These units are not absolutely necessary for proper functioning of the other units and should simply increase their efficiency. Let’s call them Special block.
  3. Units managing the loading and transfer of the data to execution units (Memory Subsystem).

Let me explain the meaning of the two new terms: Front End and Back End.

If we put it simpler, the CPU is a complex made of a core and a memory subsystem, which has already been isolated into an individual block. The processor core can be represented as a combination of two unit blocks: Frond End and Back End. Note that the first block is responsible for providing the second block with the data for processing in time. And the Back End block is a group of the processor execution units in the broadest meaning of this word. In other words, it is exactly the part of the CPU that does all the “thinking” plus the auxiliary logics. This way, each processor subsystem has its own goals: memory Subsystem delivers the data from the memory to the processor, Front End rearranges this data in a certain way and then sends to Back End, which does all the processing. There is a special group of units that are responsible for making the proper data ready for further processing, for “predicting” as precisely as possible which way the next branch will go. So it is not just a purely auxiliary function on the one hand, but a very important mission: to create the best working conditions for all other units in the CPU to work with their maximum efficiency.

So, it would make perfect sense to discuss all the units keeping in mind this logical classification.

Let’s start with Back End units block. As I have already said, these are the functional units and the appropriate managing logics. If we take a look at the scheme above, we will see that there are five functional execution units, each of which performs its own operations. Three units of the five (two fast ALU units and one slow ALU) deal with fixed-point operations, and the remaining two – with floating-point operations. All these units are connected to the logics units preparing data for them, sending operands, reading data from registers, i.e. that are doing all the preliminary work for further calculations.

To illustrate this statement let me offer you one example. What do we need to increase the contents of some “A” register by 1? We need to make sure the following operations are performed:

  1. Take the contents of “A” register.
  2. Take 1.
  3. Send both numbers and the operation code (“addition”, “increase”) to the execution unit.
  4. Perform addition.
  5. Save the result.

You can notice that the functional units are actually performing only one operation: the fourth one in the list above. All preparations such as data transfer are carried out by the managing logics unit servicing the execution block. The functional units in our example receive the numbers, receive the operation code (“add”, “subtract”, “change the sign”), perform the operation and produce the result, which is a number, too. They are certainly very important, but they are helpless without the managing logics.

Now let’s bring up a more complicated example. Say, we have two tasks. The first task is to add the contents of registers “A” and “B”. and the second task is to increase the contents of register ”C” by 1.

Of course, both these tasks can be fulfilled one after another, just the way everything was done in the first example. In this case we will need quite a lot of time. But if we had one more free execution unit, then these tasks could be completed twice as fast. This is the case when the second execution unit could make our performance twice as high.

This is how we arrived at the idea of a superscalar processor – a processor capable of performing more than one operation per clock cycle. In fact, the ability to perform more than one operation per clock cycle is the essence of the superscalar architecture. For example, Pentium 4 is a super-scalar CPU. Nevertheless, the second task could require increasing the contents of the “B” register by 1. In this case we would have to wait for the first task to be completed, even though we had a second execution unit at our disposal. So, the managing logics should also decide if the tasks are interdependent or if they could be performed in parallel. This is another responsibility of the Back End block. In particular, Pentium 4 core can perform up to three elementary operations per clock cycle that is why its managing logics should very quickly figure out the interdependencies between these instructions.

As we have already found out it makes a lot of sense to perform many operations in parallel in order to increase the overall performance. However, it is not always possible. But maybe we could start not only the next closest micro-operation but also a more remote one, which was scheduled to be run later? Maybe we could temporarily leave out a few micro-operations, which operands haven’t been calculated yet? Of course, only if the change of instructions order does not affect the result, and this condition needs to be very accurately controlled too.

Well, we have just arrived at the idea of Out-of-Order execution algorithm. Imagine that we have a number of tasks. Note that they have to be fulfilled in a certain order, because some of them depend on the results of the previous operations. And some of them don’t. And the third part is waiting for some data to be sent from the memory and hence will not start for a while.

There are two ways of solving this problem. We can either wait for the tasks to be completed in their respective order according to the initial program code, or we could try to perform all the operations which do not require any additional data. Of course, we will not be able to perform all the tasks “out of order”, but even if we could do it for a few of them, this could save us some time.

Let’s think what we actually need in this case. We need a certain buffer, where we could store the tasks. In P6 architecture this buffer is called reservation station. A certain unit (let me tell you now that it is called “planner”, although we are going to discuss its details in the ongoing chapters) will be picking out the tasks that have all the necessary operands and can be completed at this given moment. The same unit will be responsible for deciding which tasks can be completed out of order, and which cannot. Since the intermediate data obtained as a result of successfully completed operations need to be stored somewhere, we may require free registers. But there are only eight general purpose registers. While there can be more than eight tasks in the buffer. Not to mention that we cannot actually use freely the registers already occupied by the previous micro-operations.

That is why we need a set of auxiliary registers. And this set should be pretty large, so that we could use these registers without very strict limitations (in the Pentium 4 processor there are 128 registers like that). And since no x86 programs know about more than 8 general purpose registers, we need to teach these auxiliary registers pretend as if they were the general purpose ones. So, we need a special unit, which is actually called register renaming unit. In fact, it handles only one thing: when any of the tasks needs a general purpose register (for instance, register “A”), this unit takes the first free register, names it “A”, and allows the task to store the results in it when the operation is completed. And when the time comes (according to the order specified by the initial program code) this unit has to take the result from this register (thus making it free again) and send it where necessary.

The same thing is happening in the Pentium 4 processor: if certain tasks can be completed out-of-order, the managing logics performs them as if in draft and saves the results in the special registers. Later when all the operations have been completed, the logical unit will save all the results in the exact order assigned by the program. Out-of-order commands execution is another important responsibility of the Back End block.

Now let’s pass over to Front End, a group of units responsible for decoding and storing the instructions.

First of all, let me describe the problem the CPU developers faced. We have already understood that we need to make each pipeline stage as simple as possible in order to increase the working frequency. However, the irregular structure and different complexity of x86 instructions appear an obstacle here: x86 instructions are of different length, require different number of operands and even different syntax. Even if two instructions are of the same length they can still require completely different effort and resources to be executed. There are two ways-out here: we can either make the execution units “smart” so that they could work with any instruction on their own, or we could try make the instructions arriving to Back End much simpler.

The first way-out automatically results in lower working frequency, because each functional unit will handle the instructions on its own. Note that really complex instructions occur very rarely, so this complicated design of the execution units would be none other but a waste of resources.

The second way-out implies that we need a unit which will turn the x86 instructions of different length (up to 15 bytes!) and complexity into simple and easy to understand commands for the execution units. The modified instructions should boast regular structure and should suit for flawless execution in very simple functional units. By the regular structure we mean the following: the same instruction length, standard representation form (location of operands and service labels), almost the same execution complexity. If the x86 command cannot be represented as a single simple instruction, then it can be transformed into a chain of simple instructions.

The industry chose the second way a long time ago. This unit is already available in the CPU and it is called the decoder. It translates the x86 commands into instructions of a very simple format, and these instructions are now known as micro-operations (uop-s).

The ideology of micro-operations is not that new, we have already talked about it in the times of Pentium Pro, Pentium II and Pentium III. The functional units of the Pentium 4 processor work with the same micro-operations instead of x86 commands. These uop-s are generated by the decoder unit from the x86 instructions.

Let’s veer away for a moment: I believe you are quite used to the so-called Harvard cache working scheme, when we have separate cache for data and separate cache for instructions. This is exactly the way the cache of Pentium III processor was organized. But unlike Pentium III with the almost traditional commands decoding scheme, Pentium 4 can boast some significant modifications.

Instead of the traditional instructions cache where the x86 code was stored, Intel introduced a modified instructions cache called Trace cache. It is located after the decoder but before the other processor units. It no longer stores x86 instructions, but the result of their decoding, i.e. uop-s. Here the decoder works independently of all other processor units, filling the Trace cache with uop-s as fast as one x86 instruction per clock cycle at maximum. If the Trace cache contains not instructions for the code requested by the CPU, they will be relatively slowly loaded from the L2 cache being decoded on the fly. Of course, the decoding and data transfer on this stage is strictly determined by the executed program code. The micro-operations are taken from the decoder when ready and the total performance at this point will never exceed one x86 instruction per clock cycle.

When the Trace cache is addressed the second time, the corresponding data will be found. The decoded instructions will be transferred at up to 6 uop-s per two clock cycles (Trace cache works at half the frequency). Moreover, if the same part of the code is addressed the second time, the CPU doesn’t have to perform the decoding again.

Actually, this allows saving quite a significant amount of resources: despite Intel’s extreme secrecy regarding the clock cycles involved into decoding operations, there are some indirect data proving that the “length” of the decoder equals minimum 10-15 and maximum 30 clock cycles. In other words, it is comparable with the rest of the Pentium 4 pipeline. The probability that the requested part of the code is available in the Trace cache is about 75-95% for an average program.

In most cases each x86 instruction is turned into 1-4 micro-operations. This is the way simple instructions are decoded. As a rule each micro-operation occupies one cell in the Trace cache, however, there are some special cases when there will be two uop-s in a single Trace cache cell. Complex x86 instructions are transformed into special uop-s (MROM-vectors), or into a combination of MROM-vectors and regular micro-operations. These MROM-vectors are a kind of a tag for a certain chain of uop-s, and moreover, they occupy minimum space in the trace cache. When the corresponding part of the code needs to be executed, MROM-vector will be sent to Microcode ROM, which will respond with a chain of normal micro-operations marked with this vector. This storing algorithm allows us to save a lot of space in the trace cache.

One of the most interesting and frequently asked questions is about the size of the trace cache. Intel states that it is designed to accommodate 12,000 micro-operations. However, if we would like to compare it with the instructions cache, it would be a good idea to calculate the efficient size of the Trace cache (in KB of the code). Of course, its structure implies that its efficient size will be different depending on the code type. The approximate calculation showed that its size varies from 8Kb to 16KB of the “standard cache” space depending on the applied calculation algorithm.

And in the meanwhile we would like to return to the Front End block. In fact there is one more problem to be solved besides the translation of x86 commands into a simple processor format.

This is the jumping problem. Let me explain. Say, we’ve got a part of the program code which is currently performed by the processor. Everything goes on smoothly: the decoder turns x86 commands into uop-s, execution units work with them. Beautiful! But if there is a jump instruction, then we should learn about it not when the previous instruction has been executed but long before that. Otherwise, when we suddenly see the jump instruction, we will have to wait for as long as necessary for the entire pipeline to shift to a new branch.

In order to save out time there is a special unit called Branch Prediction Unit. Its responsibility is to try foresee the branching direction thus saving us some time in case of a successful prediction. And if the prediction turns out false the CPU will be fined (pipeline will be completely stopped and all buffers cleared).

The second case when we might need a unit like that is when there is a conditional branch in the program. These are the branches that depend on the result of some operation. That is why we need to “guess” if this branch will happen or not.

If we will be guessing at random, this is hardly going to be any good. To make it simpler for the Branch Prediction Unit the results of approximately 4,000 last branch transfers are stored in a special Branch History Table.

Moreover, they also monitor the precision of the last branch prediction, so that the prediction algorithm could be corrected if necessary. As a result, the decoder performs a conditional branch transition “de-facto” following the hint from the BPU, and then the BPU checks if the condition was predicted correctly.

This way, we have come to Branch Prediction Unit and the Prefetch mechanism. The latter should “guess” what data the CPU might need later on. Of course, this is not a wild guess at all: special algorithms analyze the address sequences used for data loading and try to derive the next address basing on this data.

Here is why this unit appeared so necessary. During the program execution there comes a moment when you have to address the memory and request some data from there. It could be OK, if it didn’t take the memory so long to deliver the requested data: hundreds of CPU cycles. And during this time the CPU has nothing to do. That is why it would make sense to request the data in advance, so that we could save some time and thus increase the CPU efficiency. The question here is where shall we go for this data and what data will wee need next? In fact, we will return to some details of the Data Prefetch mechanism later in this article, and in the meanwhile this is exactly what you need to know about it. The Data Prefetch mechanism works side by side with the Branch Prediction Unit and with the third block called Memory Subsystem.

Well, this is the last functional block we have to discuss now. And the task of the memory subsystem is pretty simple I should say. It has to deliver the requested data as soon as possible and save the obtained result. And since the working frequencies of contemporary memory cannot even be compared with the working frequencies of contemporary CPUs, there are caches of different levels that get involved at different stages. These are small amount of high-speed memory used to store the most frequently requested data. Pentium 4 processor has two cache levels. Some other processors may have three cache levels.

L2 cache plays the major role in the Pentium 4 processor. In fact, it is exactly the L2 cache that turns into major data storage for the Pentium 4 micro-architecture. Why so, you will find later on.

The most demanded data is still located in the L1 cache, as it used to be. But unlike the previous processor generation, such as Pentium III, the size of L1 data cache is relatively small: only 8KB (by Willamette and Northwood cores) and 16KB (by Prescott core). However, it is very fast: the cache access latency is only 2 clock cycles compared with the 3 clock cycles by Pentium III. The difference doesn’t strike as immense, but is we take into account the working frequencies (with the same production technology the working frequency of the Pentium III CPU was 1GHz, while for Willamette – 2GHz), it will turn out that the L1 data cache now requires 1ns instead of 3ns. This is a much more impressive difference, isn’t it?

Since the L1 data cache is not that big, it is quite possible that the requested data may not be there. Because 8KB (16KB) is even less than the “locality space” of most programs (except specifically written ones). If the necessary data is absent the request to the L2 cache is initiated. From the predecessor Pentium 4 inherited an L2 cache protected against blocking, which can process up to 8 requests at a time. If the data we are looking for is there, it will be copied along the 256bit bus into the L1 data cache. Moreover, Pentium 4 processor can copy 256bit of data every clock cycle! This technology is called Advanced Transfer Cache and is a continuation of the corresponding technology implemented in the Pentium III processor, which could copy 256bit of information every second clock cycle. This bus bandwidth is really demanded because if the CPU is addressing different strings of the cache, it cannot start working with the second string of the cache without finishing the transfer of the first string. That is why even though the data transfer rate like that might seem pretty high and even excessive from the theoretical point of view, it appears needed in real life.

If the requested data is not available in the L2 cache, the Bus Unit involves RAM by sending a corresponding request there. Of course, it takes RAM much longer to respond than it takes L2 cache (the Northwood core features 7 clock cycles cache latency, while with the Prescott core the situation is completely different and it features 18 clock cycles cache latency). Of course, a request to RAM is an outstanding emergency situation for the CPU, because the response from the memory will take hundreds of processor clock cycles. This is why we need the Data Prefetch mechanism so badly: it was supposed to predict what data we might need by now so that the data could be delivered in time.

But let’s get back to the cache. L2 cache is 256KB for the Willamette core, 512KB for Northwood core and 1MB for Prescott core. Moreover, Pentium 4 XE has a 2MB L3 cache (by Xeon processors the size of L3 cache may reach up to 4MB). This cache is connected to the core with a 64bit bus, features higher latency than L2 cache, and also is inclusive, i.e. it duplicates the contents of L2 cache in itself. Nevertheless, you can see the effect from L3 cache in many programs: even though it is considerably slower than any cache from the previous hierarchy, it is still much faster than RAM.

Well, these were the major blocks of processor architecture split into functional units. Of course, this is not the full description, this is just an illustration of how the CPU works. We are going to take a closer look at the functioning of selected processor systems in the next chapters. So keep on reading! :)

Chapter IV: Arithmetic Logic Units in NetBurst Micro-Architecture

This chapter is going to be devoted to ALU units of the NetBurst micro-architecture, including double-frequency ALUs.

As you remember, the major task of the Pentium 4 micro-architecture is to increase the performance by raising the working frequency. To increase the frequency, we need to perform major integer operations as fast as possible. However, the integer units differ a lot in complexity because so do the operations they execute: some of them are very complex and some of them are simple.

However, Intel engineers found a very unusual but efficient way of increasing the integer operations execution. All uop-s that come to ALU can be performed by two types of units: one [integer] slow ALU and two [integer] fast ALUs. The first unit can process quite a big number of integer operations, in particular it deals with most complex integer operations. At least this was what we thought about it at first, judging by its name. The reality, however, is going to be much more sophisticated. But let’s not rush ahead of time now.

Two fast ALU units are much more interesting and much narrowed specified. They are intended for simple integer operations, such as calculating the sum of two integers. But they perform this addition much faster than the slow ALU would have done it, because they work at double the processor clock frequency. It means that a CPU working at the actual core clock frequency of 3GHz has some units working at 6GHz (note that fast ALU are not the only units working at the double processor frequency, they are just a part of a much bigger system). We would like to point out that these two units are not identical: fast ALU 0 is more universal than fast ALU 1 and knows to execute much more commands.

But the most important is the following: in order to make it easier to speed up this part of the CPU, fast ALU consists of two 16bit wide “sub-units”, which are shifted by one stage from one another. The numbers addition is split into a few stages, so that to reduce the amount of work to be done at each stage, so that the ALU working frequency could get higher. Note that right now we are talking about the Northwood core. Prescott core is completely different at this point.

As a result, fast ALU processes numbers “by halves”. But it does it not every clock cycle but twice as fast, every half a cycle (they are called ticks). To be more exact, it works at its own frequency which is twice as high as the rest of the CPU frequency (let me call it nominal frequency). Let me remind you that when we speak of the frequency “rest of the CPU” works at we exclude Trace cache (which works at twice as low frequency as the data cache), and a few other units working at twice the CPU frequency, just like fast ALU. Each process in fast ALU is synchronized with a half of the nominal processor clock. As a result, each clock of the fast ALU is as long as only half of the “nominal” clock. All units working at twice the processor frequency are called Rapid Execution Engine, although now there is a tendency towards using this term in a broader meaning. So, let’s sum it all up now. In the Pentium 4 processor there are units that work at different frequencies at the same time: Trace cache works at half the nominal core frequency, Rapid Execution Engine (including fast ALU and the corresponding managing units) works at twice the nominal core frequency, and all the other processor units – at the nominal core frequency.

The addition of two numbers is performed as follows. At the first tick (half a clock) fast ALU sums up 16 low order positions of the two added 32bit numbers. The whole algorithm is very similar to the way you would perform the addition of two numbers on paper when you write one above the other. Two short numbers can be summed up mentally. Two long numbers have to be written down first so that you could add the numbers one by one. ALU can add two short numbers just the same way. But when it comes to two long numbers, it will take much longer. That is why 32bit numbers are added one by one in 16-bit halves, so that the addition could be speeded up (and so could the numbers processing). Very often the addition result is a number of the next order. When we do mental calculations we simply memorize this number. In the CPU this number is called “carry bit”. By the end of the first tick the carry bit for the low order positions of the two numbers.

At the second tick the second 16-bit sub-unit of fast ALU, which is shifted half a clock ahead relative to the first one, tackles the addition of 16 high order positions of the first numbers pair also involving the corresponding carry bit.

At the third tick we will already know the service flags that accompany all numeric operations (has there been an overflow, zero result, negative result, etc.). These flags can be used for further operations, such as conditional branching, for instance.

Note that the second sub-unit of the fast ALU processing 16 low order positions has already finished working at the second tick. Of course, it would make perfect sense to keep it busy anyway. This is exactly what is going to happen: at the second tick the first sub-unit of the fast ALU starts processing the next pair of numbers. As a result we start the addition of the second pair of numbers with only half a clock gap from the first one!

What is the result then? First of all, we managed to speed up the beginning of new addition operations significantly compared with the traditional ALU structure. Really, instead of one pair of numbers per clock, we now can start the addition every half a clock. Which results into complete processing of two pairs of numbers per clock. For the sake of this peak performance increase we made all these changes to the ALU unit.

At first glance we have to pay for higher processing speed with higher latencies: one and a half clock instead of one clock in the traditional ALU. Although you shouldn’t be too upset about it, as we’ve got one beautiful opportunity. Let’s see what different parts of the end result will be ready:

The opportunity we are talking about is the opportunity to immediately transfer only the necessary part of the result as a partial operand for the next addition operation. Without spending any time on the actual data transfer. Say, for example, that we have a chain of dependent addition instructions. The results of each next instruction depend on the results of the previous instruction. Let’s start with the first instruction processing. During the first tick we finish processing the low-order half of it and it is immediately transferred for the next instruction processing. During the second tick the high-order part of the result is calculated, and at the same time the low-order part of the next result is being calculated, too. And so on and so forth. In other words, a chain of 100 dependent ADD operations, each of which has the previous result as its operand, will be completed in about 50 clock cycles. As a result, the effective latency of each operation is half a clock! Excellent result! This is exactly how Pentium 4 processors based on Willamette and Northwood cores work.

Of course, there are no wonders in this world that is why it will still take the same time to obtain the flags. However, there appears a much more important limitation now: fast ALU doesn’t work with all the operations but only with the simplest ones, such as ADD. In order to make the above described mechanism work flawlessly, fast ALU should execute only those micro-operations that process their operands starting with the low-order positions and going towards the high-order positions, and in no way on the contrary. All the operations that do not comply with this condition will be processed by slow ALU. And if we will have to transfer the operands to the slow unit in this case, then the resulting delay will equal 4 ticks (two clock cycles).

The best example of an operation that can ruin the convenient processing order is the Shift operation. This command is supposedly executed in the FPU unit of the Northwood processor. To be more exact, this unit is called MMX-shifter.

Let’s read it once again: all shift are performed in MMX-shifter with 4 ticks delay. But they are being sent to slow ALU, aren’t they? What is the connection between slow ALU and this MMX-shifter then?

We get the impression that slow ALU unit is a kind of a virtual unit at all! It looks like this unit receives the tasks, transforms them into acceptable format and then sends to the actual execution units. In other words, it serves mostly as a “helper and manager” rather than the direct commands executor.

Since different operations are executed in different units the results turn out pretty interesting. Let’s consider, for instance, a “4 positions shift to the left” operation, i.e. multiplication by 16. This operation can be performed as MULTIPLY command, but it will take a lot of time this way, because Northwood doesn’t have an integer multiplication unit (!) (it used to make many CPU-design gurus really furious). Also, this operation can be performed directly as it is and we already know that in this case it will take 4 clocks (8 ticks). But since far not all operations require the same amount of time for their complete execution, it would make sense to replace one type of operations with another. In particular, you can see from the example above that it would be much more economical to replace the SHIFT operation requiring 4 clocks with four ADD operations, because the total time they will need to be completed is only 2 clocks (4 ticks). Moreover, it is better to replace SHIFT operations with ADD operations up to 7-position shifts. In fact, the Intel’s compiler used to do it exactly this way until Prescott core arrived.

The above described ALU structure makes it very important for the CPU from now on how the program is actually written and compiled. Here the Pentium 4 micro-architecture veers slightly away from the direction in which the traditional microprocessor development goes: the engineers do their best to make sure that there are no significant differences in the processing times for various instructions.

The example we have just discussed is a perfect illustration of why Pentium 4 processor turned out so sensitive to the software optimizations. However, Intel didn’t forget to offer the corresponding compiler, because without the software optimizations, Pentium 4 will simply fail to perform fast.

Now it is high time we said a few words about Prescott. We have already mentioned that it is very much different from the Northwood micro-architecture. For example, the ALU units of the Prescott core are organized completely different from all other processor units, because they use transistor logics based on differential pairs in order to increase the working frequencies. Note that “differential logics” can work at much higher frequencies that is why they applied it here. However, this logics generates much more heat per transistor, contains much more transistors and generates heat even when idling.

The operations processing algorithms are also different for these two architectures.

Firstly, the effective delay of the instructions sent to fast ALU (including ADD instructions) has changed. It used to be half a clock, and now it is twice as high, i.e. 1 clock cycle. It means that the result of the low order positions calculation can now be transferred on for further processing only in one clock. So, the above mentioned chain of 100 ADD operations will require 100 clock cycles by Prescott, instead of the 50 clock cycles by Northwood processor.

Secondly, SHIFT operations are now processed in fast ALU. Prescott acquired a special unit that executes those SHIFT operations that couldn’t be executed in the previous fast ALU (i.e. these are the shifts where the direction of the shift doesn’t coincide with the direction from low order bits to high order bits). It takes one clock to perform a shift like that, and there can be up to two shifts performed per single clock. As you remember, there used to be 4 clocks for a single shift execution with no more than one shift per clock, because they were done in MMX-shifter unit.

Thirdly, the Prescott CPU acquired integer multiplication unit. We have already mentioned that the absence of this unit could be very painful for some CPU design gurus.

Also, as I have already said, the latencies for many instructions changed, and you can read more about the actual changes in Appendix 1, which is available here (click here to view Appendix 1).

This actually leads to pretty funny consequences. If it used to be much better to have the SHIFT replaced with ADD operations in the code, then now the situation has become just the opposite. In the example above this replacement allowed completing the operation in 2 clock cycles instead of 4. But now the code optimized for Northwood processor will require 4 clock cycles to be completed by Prescott, i.e. it will take much longer than the NON optimized code, which requires only 1 clock. It is also remarkable that Prescott turned out much more similar to all other x86 processors than Northwood in terms of instructions processing times: there is no huge deviation in times for different instructions processing any more, unlike the situation with Northwood instructions.

Therefore, all the software compiled for CPUs with Northwood core should be recompiled anew for Prescott based processors.

What were those really interesting things we learned about in this chapter? We learned that the CPU has some units working at twice the core frequency (these are not only fast ALU, but also a few other units); that slow ALU is a kind of “virtual” unit, and that there is really big difference between the Northwood and Prescott micro-architecture. Of course, you understand now that ALUs of Pentium 4 processors are very much different from the ALU of any other CPU.

Chapter V: Cache Hierarchy

Of course, you may have noticed that we jumped right to execution units having omitted quite a number of other CPU architecture components. This is definitely not because there is nothing we can say about any of the previous stages. In fact, right on the contrary: we found some really interesting details there. The thing is that we need to revise some things we already know before we pass over to the detailed discussion of the planning units and micro-instruction queues. And among these things are cache access latencies, for instance.

The access into caches of different hierarchy is characterized with the time it takes to access the cache and the width of the bus between them. The table below sums up all these data for different caches of the CPU:

 

Bus bandwidth at 3GHz frequency

L1 D cache* - registers

48GB/sec***

L2 cache – L1 D cache

96GB/sec

L3 cache** - L2 cache

24GB/sec

* - the numbers are given only for the data cache, because there is no defined bus bandwidth in the Trace cache. We only know the maximum transfer rate of 6 uop-s per two clock cycles, but since there is no fixed size for a single uop, it is hard to estimate the actual bandwidth.
** - Only Gallatin core has an L3 cache right now. This is a modification of 130nm Northwood core with the integrated on-die L3 cache (with ECC support). This core is used for the following CPUs: Pentium 4 Extreme Edition, Xeon MP, Xeon DP with L3 cache.
*** - in fact more exact bus bandwidth rates are 44.8GB/s, 89.5GB/s, 22.4GB/s.

You can easily notice that different cache levels have different peculiarities. In particular, the read speed form l1 cache into registers is much lower than the data transfer rate along the bus between L1 and L2 caches. You wonder how this thing happened? Why do we need the L1 cache at all then? Does it make any sense this way?

It is actually because the task of the L1 cache is different from what the L2 cache is supposed to do: L1 cache should find and present the needed data fast and with minimal latency. Note that the read speed from this cache doesn’t exceed 16KB/clock (48GB/s). Moreover, one of the main reasons for the L1 cache to be there is the inability of the CPU to access the L2 cache directly. In order to use the data, they should create a request, find the data, transfer it to the L1 data cache, etc.

But if the requested data is not in the L1 cache, then it needs to be transferred there as soon as possible.

The following example will illustrate why this is necessary. Imagine that the program needs some data, which is arranged not in an ordered chain (with sequential addresses). All pieces of data are more than the length of a string apart from one another. Or they are even scattered all over the place within quite a big area of the memory, which is even worse. In this case we will have to transfer the entire 64 Byte string in order to read only one byte of information(!). In other words, the amount of information transferred from the L2 cache is 64 times bigger than what we actually need. Of course, the higher is the data transfer rate from L2 to L1 data cache, the better. Moreover, the decoder also grabs the data from the L2 cache.

That is why the bus between the caches features simply fantastic bandwidth of 96GB/s (for 3GHz clock frequency). All this should reduce the time for data transfer and the delays. Roughly speaking, L1 data cache should feature provide latency, and L2 cache – maximum data transfer rate.

In Prescott core, the caches are not so strictly specified. The L1 cache latency is more nanoseconds than that of the L1 cache in Northwood core. And the data transfer rate between the L1 and L2 cache in Prescott is very often lower than that in Northwood. You can see it clearly if you try addressing the beginning of every string, which is absent in L1 but present in L2 cache.

Northwood core demonstrates stable data transfer rate of 32 bytes per clock, while Prescott core shows about 1.5-2 times lower result.

I have to point out here that the time it takes to receive the data from L1 cache is greatly dependent on where this data comes from. If it comes directly from L1 data cache, then its maximum read speed will be 48GB/s. If the data is transferred from L2 cache, then it arrives into L1 at 96GB/s.

This way you see that the physical meaning of these two bandwidth rates is completely different: since the data cannot get “lost”, then we have to admit that L1 data cache works at different speed in different situations. That is why it doesn’t make much sense to provide only one number for the data transfer rate value, we should always specify what case we are talking about. Moreover, the data transfer rate also depends on the type of the requested data. For example, if we need to transfer data from the L1 data cache into the SSE2 registers, the maximum speed would be 16 bytes per clock. For the transfer from L1 data cache into MMX registers the maximum speed would be 8 bytes per clock, and for the transfer into integer registers – 4 bytes per clock. Not that simple, eh?

Note that the data can be read and stored at the same time. But in this case the reading can not go at the maximum speed of 16 bytes per clock that is why our estimated maximum data transfer rate of 48GB/s will be quite correct in the long run.

 

Northwood Cache latency, clocks

Prescott cache latency, clocks

L1D cache

2 (ALU)/9 (FPU)

4 (ALU)/12 (FPU)

L2 cache

7

18

L3 cache

14*

* - if the open page is accessed. TLB is relatively small that is why L3 cache access always occurs to the closed page. In this case the latency is about 50 clocks.

The string of L1 data cache is 64 bytes long. Half the cache string (32 bytes of data) can be loaded from the L2 cache every clock cycle.

L2 cache is arranged with 128-byte strings, which are split in two sectors 64 bytes each. Each of the sectors can be read independently. You will understand what’s behind this organization if you recall that the data is transferred from the memory in 128-byte blocks, while into the memory it goes in 64-byte blocks.

L3 cache is not of that much interest for us here, because it performs an auxiliary function in NetBurst micro-architecture we are discussing today. It can be up to 4MB big, is located on the processor die and is connected to the core with the 64bit bus. The L3 cache access latency is certainly higher than by L2 cache, but it is still much smaller than the memory access latency. Besides, it is bigger than the L2 cache which significantly improves the probability that the requested data is available there.

L2 and L3 caches do not get blocked and can process up to 8 requests at a time. Northwood features 4-channel partially associative L1 cache, and Prescott – 8-channel partially associative L1 cache.

There is one important thing to be mentioned: the cache access latencies given in the whitepaper documentation differ from those we get during real testing, because they do not take into account the cache functioning strategy. The claimed values are none other but the L2 cache latencies in case the cache is completely isolated from all other sub-systems.

But any actual request to the L2 cache of the Northwood processor will show 9+ clocks. When the new data is requested, the L1 cache is always checked first. If the requested data is not there, the search will continue in the caches of other hierarchies. In this case it will mean the following: the total latency will equal 2 clock cycles in L1 cache and 7 clock cycles in L2 cache.

So, the actual access latency in Prescott core will equal 22 clocks or even more. Note that it won’t be easy to get these numbers: the requests need to be composed in a specific way, so that there would be no influence of the data prefetch on the latency.

By the way, since we came to speak about prefetch, let me remind you that this mechanism was developed even more in Prescott (as you remember its major task is to “deliver” the data to the processor core in time). To be more exact the limitations, which used to be the case in Northwood core, have finally been eliminated. In particular, the prefetch of Northwood processor couldn’t cross the 4KB border of the virtual memory page, which reduced its efficiency tremendously. The problem was that the Northwood prefetch mechanism could grab the data from the page only when it had already been written into TLB. If they requested the data of the “following” page, and it hadn’t yet been written into the TLB, then the Northwood core would not initiate the request to page tables for further translation and would not add any record into the TLB. That is why this prefetch mechanism didn’t go beyond the 4KB limit. In fact, it can go beyond 4KB, but only if the data is already in the TLB, and it cannot add any record about the page into the TLB, if the page is not there yet.

In Prescott everything is absolutely different. Now the prefetch mechanism is more aggressive, and doesn’t have any limitations such as 4KB border of the virtual memory page. Moreover, it handles the loop exits much better.

As a result of all these improvements (as well as improved bus speed), work with the memory subsystem turned into Prescott’s major advantage. And the work with the caches got somewhat less efficient, and the maximum bandwidth got lower, especially during writing operations. In fact, we have already mentioned it above.

I have to point out that once the size of the L1 and L2 caches changed, the cache conflicts processing speed also changed. Aliasing conflicts are the most important ones among them. In this case this term implies that there is a situation when two different addresses in the memory have identical low order 16 bits, for instance. In this case if we try to locate both memory strings in the cache and address them in turns, they will be ousting one another from the cache every time, which will reduce the caching efficiency a lot.

Since we haven’t yet dwelled on the aliasing in our article, let’s do it now.

In order to understand better what the aliasing conflict is about, we will have to recall the structure of the n-channel partially-associative cache.

The easiest would be to imagine that the cache is a 3D cube on the picture below. The number of layers (lines) in each set of the cache equals to the number of banks or ways (on the picture). The vertical section of this cube will be a set of cache lines with the same indexes.

Let’s consider some real cache as an example. Take, for instance, L1 data cache of the Northwood CPU with the total size of 8KB. We know that L1 cache of this processor core is a four-channel partially-associative cache with 64 bytes of data in each line.

It will be a set of four ways, each having 32 sets (set1, set2, set3,…). On the picture above each cache channel or way is marked as Line 1, Line 2, Line 3 and Line 4. Each line is 64 bytes long. So, the total cache size will be calculated as follows: 4x32x64=8,192byte or 8KB.

When we need to find a certain address, the following operation takes place. The cache controller logics processes low order address bits in each set and finds a set with the bits we need in positions from 6 to 10. Then with the help of bits from 11 to 15 we identify and select the way inside our set, which tag corresponds to our initial request. As a result we get the desired element.

This is exactly where the first obstacle appears. Sine we cannot process all bits of the address fast enough, we process only a part of them. But two different addresses with the identical low order 16 bits will be identical from the cache controller’s point of view. As a result, these two different addresses with the identical 16 bits will be ousting one another from the cache, when we are requesting them.

This is what they call aliasing problem.

In Prescott core they managed to solve this problem in a very smart way. The set is still identified according to the bits from 6 to 10, but the way is identified differently. Now they identify the way according to bits from 11 to 21. In other words, there are 11 bits involved, instead of 5. So, now we only face aliasing problem for the addresses that are 4MB apart, and not only 64KB apart, as in Northwood.

Moreover, the associativity level of the L1 cache has also been slightly changed. Now it is an 8-channel partially-associative L1 cache. There are still 32 sets, each line contains 64 bytes. And the total size is 8x32864=16,384 bytes or 16KB.

As a result, the cache access speed got somewhat lower (cache latency is 4 clocks instead of 2 now), however, the aliasing problem has been almost completely eliminated. As a result, the performance fines caused by aliasing problems got really rare. To be more exact, these fines grew bigger but started to occur much more rarely.

It would make sense to assume that longer tags are one of the major reasons for the higher cache access latencies.

Also I need to mention that 64KB aliasing is not the only type of aliasing in the Northwood core. It also has 1MB aliasing. Its nature is certainly not so evident, and the best supposition of ours is that the processing of bits 21-30 is postponed until the data arrives from the cache. In most cases it allows to speed up the overall processing, but sometimes it leads to collisions because of this delayed processing of the remaining address bits.

In fact, aliasing problem shouldn’t be a must for the L1 cache. In particular, the K7/K8 CPUs do not have this problem, because they always check all bits of the requested address.

It’s time we took a closer look at the Pentium 4 processor pipeline and summed up everything we know about it.

Chapter VI: Pentium 4 Pipeline Architecture

Well, it’s time we took a closer look at the Pentium 4 pipeline, especially, since there is hardly any detailed info about it anywhere available for public access. What we know is that the pipeline of the Northwood processor (the part after Trace cache) consists of 20 stages, while by Prescott this number is even bigger and equals 31 stage total.

Let’s take a closer look at the Northwood pipeline and find out what exactly is happening at each stage:

  1. TC Nxt IP 1
  2. TC Nxt IP 2: at these two stages Trace cache finds those micro-operations the last executed instruction points at.
  3. TC Fetch 1
  4. TC Fetch 2: at these two stages up to 6 micro-operations are selected and sent to a special Fetch Queue, where the order of micro-operations corresponds strictly to the initial code. If there is a MROM-vector among these micro-instructions, further reading from the Trace cache is temporary halted, and the MROM-vector in the Fetch Queue gets replaced with the sequence of micro-operations it represents. As you remember, the Trace cache actually works at half the nominal frequency.
  5. Drive: micro-operations are moved towards a special unit called allocator. At this stage micro-operations do not change, they simply move along the pipeline. This stage seems to be necessary because the working frequency of the pipeline is sometimes not enough for the micro-operations to make it to the required unit within a single clock.
  6. Allocator: here a special unit selects from the Fetch Queue three micro-operations, for which there are special processor resources reserved. Among these special resources are positions in queues, register file elements and instructions reorder buffers. The operations prepared this way are then sent to other queues, which we will call uopQ. Moreover, as I have just said at this stage there are spots reserved in the ROB Reorder Buffer for these micro-operations, which will be helpful for instructions retirement.
  7. Rename registers 1
  8. Rename registers 2: at these stages the logical registers are being displayed over the actual physical registers. In IA32 there are 8 general purpose logical registers, while the physical registers are much more numerous: 128. This operation is necessary for the separate commands to be processed independently, without waiting for the necessary register to become free. The micro-operations order here corresponds to the initial code of the program.
  9. Queue: at this stage the micro-operations prepared by the Allocator are sorted out and arranged into special queues, uopQ. There are two uopQs. One of them is intended for address calculations, and another for all other uops. The micro-operations order corresponds to the program code. From the uopQs micro-operations are sent to the scheduler queues, schQ. We are going to pay special attention to uopQ and schQ later in this chapter.
  10. Schedule 1
  11. Schedule 2
  12. Schedule 3: at these three stages a lot of interesting things happen. First, schedulers receive micro-operations from the uopQs. Note that they strictly select the oldest micro-operations and retain their order, although from the two uopQs the micro-operations are selected independently. Besides, each scheduler receives a specific type of micro-operations and then they are placed into schQ. There are 5 schedulers in total. So, there are five schQs. The type of the given micro-operation determines the unique scheduler queue it can go to. From the schQ micro-operations are sent for execution through the issue or execution ports. There are four ports like that.
  1. Dispatch 1
  2. Dispatch 2: here micro-operations are getting ready for further execution and are provided with the corresponding operands. Then they will go to the execution units through the corresponding issue ports. The order of micro-operations is determined by their readiness for further processing and not by the initial program code.
  3. Register File 1
  4. Register File 2: at these stages the operands are read from the register file. As we have already said, the reading from this file can go at twice the core frequency because this is required for fast ALU functioning.
  5. Execute: for this very stage the entire above described structure has been actually created. It is at this particular stage that the received micro-operation with operands falls into its functional unit and gets executed. It is important to note that the result of this operation can be immediately “grabbed” by the operation waiting for it and at the same time written into the Register File (this is called bypass mechanism). Instead of the logical but very time-consuming chain of actions (first save the data to the register and then read the data from there), we immediately send the result to the unit waiting for it. This “immediate” data transfer can even happen between two different fast ALUs. In all other cases data synchronization or transfer can cause short delays (see Appendix 2).
  6. Flags: at this stage the flags required by the program are calculated and set. For example, such as zero result, positive result, readiness indicator, etc. In particular, these flags serve as the entry data for the next stage, where the branch predictions will be verified.
  7. Branch Check: at this stage the branch prediction unit compares the branch address predicted for the just executed command with what has been predicted earlier. And if the prediction was wrong, the prediction algorithm will be corrected accordingly. This way the branch prediction unit collects prediction precision statistics, which is then used for prediction model correction.
  8. Drive: the result of the check done at the previous stage is sent to the decoder.

The micro-operation is waiting for the retirement in order to free the resources it used and to have the final non-speculative result written into memory. Retirement is strictly sequential (according to the program code) and is performed over the same triplets of micro-operations, which have been formed at the earlier pipeline stages. Only one triplet per clock can be processed here. It is important that if one of the three micro-operations in the triplet is not ready yet, the system will be waiting for the execution to complete. The retirement is based on the data about the initial micro-operations order, which was written into the ROB at stage 6. This whole process is within the responsibility of the Retire Unit, which is located at the very end of the pipeline.

The picture below represents the principal pipeline schematics, which will help to illustrate the statements above:


Click to enlarge

An attentive reader has already noticed that there appeared some “new characters” at many pipeline stages. These “characters” are various queues, at least three types of queues. Since we haven’t yet paid special attention to this subject, let’s discuss them in a little bit more details now.

So, at stage 6 (Allocator) three micro-operations are selected from the only available queue called Fetch Queue. Then we reserve processor resources for these micro-operations. After that these operations are located in two uopQ queues. One of these queues is for address operations, and another – for all other operations. As you can see from our explanation of the pipeline architecture these operations are distributed between the two queues at Stage 9.

The main task of the uopQ queues is to distribute micro-operations of different types between different schedulers correctly. That is exactly why uopQ for address operations accepts only two types of uops: “load [address]” and “store [address]”. All other operations, including “store data” are placed in another, major, queue. The address queue can be 16 micro-operations deep, the major queue is longer, and can receive twice as many operations, i.e. 32. The micro-operations are lined up in these queues sequentially: when one of the queues gets overflown, another one closes and doesn’t accept micro-operations any more. There are two practical advantages in this queue organization: “early” loading and faster execution of short parts of the code depending on the results of longer operations.

When the schedulers process micro-operations (at stages 10, 11 and 12) the uop-s get distributed among the next type of queues: scheduler queues.

The micro-operations are sent sequentially from the uopQ, according to the FIFO principle (first in – first out). The queues here work independently from one another: for instance the micro-operation from the address queue can be selected before all other preceding operations leave the “main” queue. In many cases this allows to start loading the data in advance, which can be helpful. However, there is also another side to the coin: this independence increases pretty significantly the probability of incorrect situations, such as the attempt to read the data before this data has actually been calculated by the corresponding micro-operation of the main queue.

The data transfer rate however is pretty high here: up to two micro-operations per clock cycle for each scheduler. And this speed is valid not only for fast schedulers (see below), but can also be achieved by slow schedulers. uopQ queues are very sensitive to semi-clock events (and thus they can also be considered units working at twice the clock frequency). If the micro-operation cannot be sent to the scheduler because of the schQ queue overflow, the transfer of all other uop-s from the uopQ is halted. If the micro-operation can be sent to one of the two available schedulers, then the system can actually chose the scheduler depending on the schQ queues status.

There are five queues as well as five schedulers, we have already mentioned it in the pipeline description. Let me list all these schedulers now:

FAST_0 – works with ALU micro-operations: logical operations (and, or, xor, test, not); ALU store data; branch; transfer operations (mov reg-reg, mov reg-imm, movzx/movsx, lea simple forms); simple arithmetic operations (add/sub, cmp, inc/dec, neg).

FAST_1 – works with ALU micro-operations. Among them are transfer operations subsets (except movsx) and arithmetic operations subsets (except neg). It looks like all operations sent to FAST_1 can also be sent to FAST_0.

SLOW_0 – works with FPU micro-operations dealing with data transfer and conversion. (for x87, MMX, SSE, SSE2-instructions); FPU store data, too.

SLOW_1 – works with ALU- and FPU micro-operations: a number of simple ALU-operations (shift/rotate; some uop-s created by adc/sbb) and all complex uop-s starting with multiplication, as well as the majority of “computational” FPU-operations.

MEM – AGU-operations: load and store address.

It is evident that all operations from the uopQ address queue are sent to the MEM scheduler queue. All other micro-operations fall into one of the remaining four scheduler queues.

We have already mentioned that fast ALU are not the only units working at twice the nominal clock frequency. There are other processor units that also work twice as fast as the nominal. Among them are two schedulers: FAST_0 and FAST_1, each servicing its own fast ALU. This speed refers to the selection and transfer of micro-operations. In some cases, the unit receiving such micro-operations as “ALU store data”, for instance, cannot provide the necessary transfer rate. As a result, these micro-operations can be sent to the corresponding units only at every even tick.

SchQ queues, which correspond to reservation stations (RS) in P6 micro-architecture, provide data selection with modified order. This is exactly the place where out-of-order commands selection occurs.

The operations are sent for execution so that by the time they arrive, the execution unit is already free and the operands are available. If there is more than one operation meeting the requirements for a given schQ queue, then the preference will stay with the “older” one.

As you can see from everything we have just said, the schedulers are responsible for all that preparatory work, which is essential for proper CPU functioning.  In fact, the schedulers are the heart of the logics servicing the functional units in the CPU. In particular, the schedulers have to “feed” the commands and data into functional units, to educe their idling time to minimum. However they need to know the schedule for the micro-operations execution and the operands availability in order to plan the work of the execution units efficiently. Most commands need a set amount of time for their complete execution, and these times are known. The actual problems occur when the time of data delivery is hard to predict (especially, when the data needs to be transferred from the memory).

Since each micro-operation has to pass a few pipeline stages on the way from the scheduler to the functional unit, the scheduler needs to calculate/predict the situation for a few clock cycles ahead. Certain allowances are made for micro-operations depending on those one, which execution time is variable (such as data loading, for instance). Let’s bookmark this place: later on we will need to recall this scheduler peculiarity in order to better understand how the replay works.

The schQ scheduler queues are of the following size: SLOW_1 – 12 positions, SLOW_0 – 10 positions, all other ones 8 positions each. These numbers define the maximum “window” for instructions of the same type, which execution order can be changed. Let me once again stress: the quality of out-o-order commands execution depends on the length of the scheduler queues.

Once the schedulers completed their part of micro-operations processing, the micro-operations are sent to the execution units via four issue ports.

Port 0 is used by FAST_0 and SLOW_0; port 1 is used by FAST_1 and SLOW_1. Load micro-operations sand store-address micro-operations are sent by the MEM scheduler to ports 2 and 3 respectively. Ports 0 and 1 receive uop-s from fast schedulers every tick of the clock, and from slow schedulers – every second (even) tick. One port can process only one micro-operation within a tick (half a clock), and if there is a conflict between two schedulers, then the “older” micro-operation is preferred. The system reads registers from the Register File at twice the clock frequency. Schedulers, Register File, issue ports and fast ALU are all parts of the so-called Rapid Execution Engine.

We have just discussed the pipeline structure of the Northwood processor core. The pipeline of the Prescott core is known to be much longer: it consists of 31 stages. However, the new stages this pipeline acquired are mostly Drive stages, when the micro-operations or data are simply transferred along the pipeline without any specific processing.

Chapter VII: Hyper-Threading Technology

It looks like now is the perfect time to put in a word about the well-known Hyper Threading technology (HT). The main idea behind this technology is very simple: all processor execution units (or the majority of them) are almost never busy all at once during the program code execution. As a rule only one third of the available processor computational resources is occupied (according to Intel’s own estimates), which is actually quite humiliating, I should say.

So, it suddenly occurred to them: why don’t we use the part of the execution units that are free from working on the current program code for some other program execution or other thread execution within the same program? In fact, this is a very reasonable idea. By the way, they first voiced out this idea in 1978, then Cray implemented it in their CDC6600 (although this was not a single-die CPU). AT that time they called it Simultaneous Multi Threading. So, I wouldn’t say that Intel was highly original when they first came up with the idea of HT technology. Nevertheless, we should give them due credit: Intel was the one to bring Hyper Threading technology into the PC market.

How does this technology actually work? The basic idea seems to be pretty clear, but how did they implement the simultaneous processing of several tasks?

In fact, it was all done in a very simple and logical way. The CPU with Hyper Threading technology is recognized by the operating system as two logical processors. So, only if the operating system can work correctly in multi-processor configurations, it will be able to work correctly with Hyper Threading technology. Each of the logical processors is assigned a corresponding “architectural status” defined by the values of the logical registers, service flags and status registers. The “architectural status” of the two logical processors is different, and the logics managing them should know to process the status of each processor separately. These are the blocks that had to be added to the CPU.

All in all, the CPU can be split in two parts: two description sets for “architectural statuses” of our logical CPUs, and a single core for both of them, which actually executes the commands. To make it simpler, we could say that the units implementing Hyper Threading support belong to the Front End unit block of the processor.

So, the following units had to be added to the processor to guarantee Hyper Threading technology support (to be more exact, they had to duplicate a number of already existing units): Instruction Streaming Buffers, Instruction TLB, Trace Cache Next IP, Trace Cache Fill Buffers, Register Alias Tables. Note that all these extra units do not increase the die size that much. It gets less than 5% bigger. Take a look yourself:


the picture above presents Intel Xeon MP processor with 256KB L2 cache and 1MB L3 cache,
manufactured with 0.18micron technology

This way, most of the units added to the CPU are special buffers saving the info about architectural status of the logical processors.

Now the working principles get very easy to understand: different programs (or different threads of the same program) are simply sent to different logical processors. The only difference is that they are still executed in a single physical processor, so it looks like the biggest performance gain could only be obtained when the running programs use different execution units.

Of course, it is pretty logical to assume that two computational programs using the FPU unit will not be running any faster: we still have only one FPU unit, which will not be able to execute two tasks simultaneously. So, pure logics suggests that we should have different program threads use different processor resources, so that we could benefit from the multi-threaded processing.

But things turn out not that simple at all. In reality it may turn out that two threads with intensive floating-point calculations will be processed faster in HT mode than one by one, while the thread waiting for the data from the memory will slow down the next thread, which is not even working with the memory at all.

Why would it happen like that?

Let’s take a closer look at the example with two FP threads (the same is true for MMX and SSE, of course), for instance, a simple iterations loop. As you know each operation has a fixed execution time (latency). Say we have a multiplication here, FP_MUL with 6 clocks latency. Having sent this command for execution, the scheduler will halt all other commands depending on the result of this multiplication for at least 6 clock cycles, although FPU will already be ready for the next FP_SADD command the next clock cycle. If there are no independent commands like that in the queue (and there are only commands of FP_MUL type, the next clock cycle will be skipped. If there are no independent commands in the queue at all, the FPU unit will be idling for 5 clocks.

The second Hyper Threading thread will use these “empty” clocks for its calculations, because the commands of the two threads are totally impendent of one another. Of course, the average number of independent operations in the FP queue can be increased for the single thread with the help of special optimization techniques (such as “de-looping”, for instance). However, if you want to use up the entire potential of the FPU, you will need 5-6 independent threads with the equal share of FP_ADD and FP_MUL commands, which will need to have the cached data at their immediate disposal. And this is far not that trivial optimization challenge for most algorithms.

This simple example allows us drawing two somewhat paradoxical observations:

  1. The NetBurst execution units’ resources seem excessive at first glance, and their shortage should not affect the Hyper Threading efficiency. This is extremely illustrative for integer fast ALU units, which can execute more uop-s (up to four per clock) than the Trace cache can send (up to three per clock).
  2. The maximum performance gain from Hyper Threading compared with the sequential threads processing can be obtained in non-optimized applications. The optimization increasing the IPC of one thread reduces the Hyper Threading efficiency. Moreover, if the threads sharing processor resources compete not only for the execution units (cache, queues, buffers), then Hyper Threading can start harming the overall processor performance at a certain optimization point, and two threads will be processed much faster in a succession than in parallel, in the Hyper Threading mode.

But let’s return to our discussion of Hyper Threading technology implementation in the Pentium 4 CPU.

Hyper Threading technology is pretty easily implemented in the NetBurst architecture due to such specific features of this architecture as Trace cache. It’s true that in traditional architecture such as P6, the decoder is tightly connected with the execution units. In order to process two instruction threads simultaneously, they should be simultaneously transformed into micro-operations, which is really hard to achieve. But the worst thing is to select them for both threads at the same time, which would be a pretty sophisticated task since the x86 instructions length is variable.

It is a completely different story if we have Trace cache: we have some already decoded commands for different program threads (or different programs). So, it would be much easier to select the micro-operations for the appropriate threads: all you need is just to add a service index to the micro-operation, which will identify the thread it belongs to. Especially, sine the execution core is not very closely connected with the decoder and its functioning doesn’t depend on the decoder operation directly (all we need is the sufficient amount of pre-decoded instructions in the Trace cache).

So, both logical processors work on the same physical core. We have already mentioned above that all core resources will be split into large categories: “shared” and “distributed”. The so-called distributed resources include Fetch Queue, Uop Queue and Scheduler Queue. For each logical processor the queue depth is smaller because a part of its capacity is assigned to another logical processor. In other words, these resources are split in two halves: one for each logical processor. Or, for example, there is a position in the queue, which can be occupied only by the first logical processor, and another position can be used only by the second one. In case of shared resources their actual distribution between the logical processors will be arranges individually for each particular case. However, note that there is a special system preventing one logical processor from using up the entire resource capacity. You understand why this system is necessary: if we have one fast and one slow (or stalled) commands thread, then the latter can theoretically occupy all queues thus blocking the execution of the first thread.

Therefore, there should be a certain algorithm which would allow distributing the queue capacity between the two processors in the most efficient way. How can two threads share the positions in the UopQ (or any other queue) with the fixed number of positions? There are two different ways: competitive way (when each of the threads tries to take as much of the resources from the opponent as possible) and fixed way (50:50, for instance). In the first case it is quite possible that one of the threads will slow down or oust the second thread from the queue completely. In the second case, the resources will not be used efficiently, if one of the micro-operations threads requires more than 50% of them: one of the threads will lack resources, while the other one will simply waste them.

In the Pentium 4 processor the queue capacity for each thread is fixed: each of the logical processors has twice as short Fetch Queue, Uop Queue and Schedulers Queue at its disposal than in the example above with the disabled (absent) Hyper Threading technology. This certainly has some negative influence on the performance of each logical processor, but makes it impossible for any of the threads to block the processor. So, here the resources are distributed absolutely equitably. Important notice: the micro-operations are moving along the queue independently for each logical processor.

All other resources of the processor are shared in this interpretation: register file, schedulers, execution units, all caches, loading blocks. Here the resources are used according to the competitive principle: first come first served. Of course, not without arbitration: if both logical processors addressed the same unit, then the arbiter indicates strict processing order.

Let me offer you a schematic representation of a pipeline with the labeled distributed resources:


The blue color indicates the resources of one logical processor,
and the gray color – of another one.

And now let’s find out what the arbitration looks like for the Trace cache. Trace cache is an 8-channel partially associative cache working according to LRU algorithm. The preliminarily decoded micro-operations in Trace cache are already assigned two independent arrays of “next instruction pointers” – one for each logical processor. This shows each logical processor where exactly the next micro-operation for its thread is located.

Both logical processors compete for the Trace cache access each clock. If their requests come in simultaneously, the Trace cache access is granted in turns to each of them every other clock. In other words, the first one accesses the Trace cache on the first clock, the second – on the second, then the first one access Trace cache again on the third clock, etc. If one of the threads is stalled (of there are no decoded micro-operations in the Trace cache for this thread), then the second thread has the Trace cache at its full disposal.

The similar access scheme is used if there is a link to Microcode ROM in the Trace cache. The Microcode ROM access is arbitrated the same way as Trace cache access.

If the micro-operation the next-instruction-pointer flag is pointing at is not in the Trace cache, the request should go to L2 cache, and when the x86-command arrives from the L2 cache it should be decoded into the micro-operations. While we are working on obtaining the next x86 instruction, we should transform the virtual address of the micro-operation indicated by the next-instruction-pointer into the physical instruction address (because l2 cache works with physical addresses only). There is a special Instruction Translation Lookaside Buffer (I-TLB) used for this purpose. It receives the request from the Trace cache, translates the virtual address of the instruction into the physical one and sends the request into the L2 cache. The searched command is available in the L2 cache in most cases. It is then moved to a special streaming buffer of the corresponding logical processor (2 lines, 64 bytes each) and stays there until it is completely decoded and transferred to the trace cache.

Each logical processor has its own I-TLB table, i.e. this unit was duplicated when they implemented Hyper Threading support (the processor core scheme shown in the beginning of this chapter demonstrated exactly this additional I-TLB table). Moreover, each logical processor features its own set of next-instruction-pointers, which help to monitor the decoding of x86 commands.

The cache requests of the two logical processors are arbitrated according to the FIFO principle (the earlier request is processed first, all requests are processed in order of appearance).

The branch prediction unit is partially shared and partially duplicated. The return stack buffer is duplicated, because it is a small unit and moreover, the request/return addresses should be collected for each logical processor individually. The Branch History Table is also duplicated for each logical processor, because the prediction history should be monitored for each logical processor individually, too. Nevertheless, the global Branch Table is a general unit where each branch occurrence is marked with the individual index of the logical processor.

The instructions decoder should process x86 commands for both threads even though only one x86 instructions can be decoded at a given moment of time (and only for one logical processor, of course). Moreover, the decoder can process a few x86 instructions before it shifts to the buffer of the second logical processor. In other words, the decoding window for each logical processor can actually be more than one command, i.e. the decoder doesn’t have to switch between the two logical processors every clock cycle. This way, the decoder is shares in a completely different way from the Trace cache. We don’t know what the exact size of the decoding window is, but it seems quite logical to suppose that it is determined by the number of x86 instructions in the streaming buffer of the logical processor. This internal structure allows the decoder to avoid switching between the logical processors every clock cycle.

It is also very important to pay special attention to the way the resources are shared inside the CPU core. Let’s now discuss the Back End block and take a look at the corresponding part of the pipeline scheme:


Some terms on this picture differ from what we have been using before.
Nevertheless, this is a very illustrative picture that is why we will specifically stress
where the markings or terms differ from what the picture suggests.

As we remember from the previous chapter, out-of-order commands execution begins at schQ stage (marked as Shed on the picture), and finishes at the Register stage (marked as Register Write on the picture). So, what are the peculiarities here?

The Allocator, which we have already discussed in detail in the previous chapter, one of the key logics elements, has the following resources at its disposal: 126 records in the ROB (Reorder Buffer), 48 records in load buffers and 24 records in store buffers. Also it has 128 integer registers and 128 floating-point registers.

The maximum resources available for each of the two logical processors are limited. Each logical processor can occupy up to 63 ROB records, up to 24 load buffer records and up to 12 store buffer re cords. This limitation is imposed to prevent each of the logical processors from taking over all the resources.

When there are micro-operations for both threads in the Fetch Queue, the allocator will be switching between the logical processors every clock allocating resources for them in turns.

If one of the two logical processors lacks any of the resources (for example, free store buffer positions), the allocator will generate the “stall” signal for this processor and will continue allocating resources for the second processor. Also if the Fetch Queue contains micro-operations only for one logical processor, then the allocator will provide this processor with resources every clock in order to maximize its performance. However, the limitations of the maximum resource capacity available for a single logical processor will still be limited in order to prevent the CPU from being blocked by one of the threads.

Another duplicated unit shown on the picture in the beginning of this chapter is the Register Alias Table (RAT). It should display 8 architectural registers over 128 physical ones. Each logical processor has its own RAT, and the data in it are an inalienable part of the architectural status of this logical processor.

Schedulers, the heart of the out-of-order execution system, work with the logical processors in a bit different way. It doesn’t matter for them what logical processor uses which micro-operations. The schedulers can send up to 6 micro-operations for execution within a single clock cycle. These can be a pair of micro operations from each logical processor, or three uop-s from one logical processor and one uop from another. There is still one limitation here: all positions in the queue of the given scheduler cannot be taken by a single logical processor so that we could prevent one logical processor from capturing all the resources and thus blocking the other logical processor.

The execution units and the memory subsystem are shared between the two logical processors without any serious limitations. These units do not care which logical processor owns this or that micro-operation.

DTLB (table for translation of virtual addresses into physical ones, only for data this time) is also shared. Each record in it is marked with the identification index of the logical processor.

The caches of all levels are also shared by the two logical processors. There are no limitations, and the origin of the requests is not monitored.

Of course, resources sharing is not free from certain problems. As we have already seen above, as a result of competition between the two logical processors for the shared resources, the resources available for each thread are smaller than what a single thread would have. In particular, Northwood processor boasts a relatively small L1 data cache of only 8KB. With Hyper Threading technology the effective size of the L1 data cache for each of the threads is almost twice as small and equals about 4KB. By the way, this pushed us to the idea that the existing improvement of the Hyper Threading efficiency in the new Prescott core is not only connected with the enhancements made to the technology itself, but also with the twice as large L1 data cache. Which has certainly reduced the performance losses caused by a too small L1 data cache.

To tell the truth, all of us consider this idea pretty logical.

Anyway, since we have two threads processed simultaneously, the processing speed of each thread is generally lower. In particular, it takes more time to complete the execution of the main thread if there is a background thread being executed at the same time. Nevertheless, in most cases it still takes less time for both threads to be executed simultaneously than it would take to execute each of them individually. In other words, if there is a second thread, the first thread would be executed slower, but the performance would grow up anyway because it would take less time for execution of both threads than in case we had to execute them one by one.

Software optimization is a very important issue connected with Hyper Threading technology. Really, if the active thread requests some data from the memory but doesn’t free the execution units at this time, then the second thread will not be able to continue its work, while the first thread will not have the data for further execution either. Moreover, even in a single-threaded system it is more than enough to overload the queue of a single scheduler and no command will get to any other scheduler even if it is absolutely free. That is why the quality of the code optimization is one of the major criteria of the Hyper Threading efficiency.

Besides that the operating system should work correctly with Hyper Threading technology. In the Windows family this is Windows XP (and newer OS’s). The older operating systems such as Windows 2000 do not know to distinguish between the physical and logical processors, although they can still work in configurations like that.

Let’s sum up a few things now. Hyper Threading is based on the idea of increasing processor efficiency by splitting the processor units between different logical CPUs thus reducing their idling time. Hyper Threading can provide from 0% to 30% performance improvement, and in some cases can cause slight performance drops. The efficiency of Hyper Threading technology depends a lot on the quality of software optimization. It cannot replace dual-processor configurations, but the user doesn’t invest that much actually when he or she acquires Hyper Threading supporting CPU (the die size is not getting much bigger).

From the technological point of view, Hyper Threading is not too complicated to implement. However, the logical managing the entire processor structure will have to work in a much more complex way. Moreover, without Hyper Threading the CPU could have looked completely different.

All in all, Hyper Threading technology should be considered a worthy addition to the NetBurst micro-architecture. In most cases it does improve the processor performance, which is the first criterion of effectiveness for any innovation.


To be continued!