by Victor Kartunov
05/30/2005 | 01:32 PM
Please check out the first article from the trilogy called Prescott: The Last of the Mohicans? (Pentium 4: from Willamette to Prescott) here.
No matter how exciting the mysteries of the Pentium 4 performance are, the article cannot be endless. But, it is high time we described something we discovered during our Pentium 4 micro-architecture investigation. In fact, this something is exactly the reason why we decided to write this article. Well, this something anyway turned out a pretty mysterious thing from the structural point of view as well as according to the official documentation.
This is how the whole thing happened.
Awhile after we started working on this article, when we had a sort of draft write-up of the first seven chapters of the article, it seemed that the major traits of the Pentium 4 micro-architecture were already quite known to us. We were very happy about the article being almost finished and applied all our efforts to polishing off the small details: checking cache latencies and comparing the results with what the documents stated. Although we didn’t question the data in the official papers, we had our own measurement techniques developed while working on the previous article about Athlon 64/Opteron micro-architecture, so we really longed for a fair comparison. Especially, since at the time we were writing the article called “AMD: Per Aspera Ad Astra”, we noticed that the Pentium 4 processor behaved kind of weird: the results of the L2 cache latency measurements didn’t make any sense. We had to find an explanation to this phenomenon.
The test of Northwood based Pentium 4 processor was carried out by a dependent commands chain, like move eax, [eax] (the co-called pointer-chasing). Theoretically, everything was supposed to be predictable here: according to the documentation, the L1 cache latency equals 2 clocks, and the L2 cache latency – 7 clocks. In other words, we expected to get the total latency of 9 clocks.
Our reaction to the actual results we obtained can best be described with the phrase “struck dumb”. The problem aroused from where no one would expect it to. Instead of the firmly stated latency value from the white-paper, we were facing something unbelievable, looking more like a cardiogram of a heart patient.
First of all, we never saw anything similar to the expected (and claimed by the documentation!) 9 clocks. The CPU managed to somehow generate tens of clocks of latency time instead.
It seemed that we had to undertake pretty evident measures to find out where this whole show was coming from: check the white-paper again. But, it was not that simple. There was no mention of any phenomena like what we were witnessing. The optimization guides also didn’t contain any answers to our numerous questions.
Moreover, this behavior of our CPU made us ask ourselves: did we really study all the principal subsystems of the Pentium 4 micro-architecture carefully enough? Maybe there is some important subsystem, which hasn’t been described yet? Does this effect we have just discovered affect the processor performance in any way? And if it does, then how big is this influence? Will we see the same in any real applications at all?
Well, unexpected challenges have never scared us away, I should say. So, our entire team got down to the development and checking of various hypotheses. Unfortunately, despite all the brilliance of these hypotheses didn’t help and after a while we gave up desperate attempts to explain the above described situation in a reasonable way using the whitepaper data we had at our disposal.
It “just worked” this way. But we wouldn’t buy it then.
Besides digging through the whole bunch of technical data, we also focused on other “traditional” ways of obtaining information: blackmailing, flattering and bribing the top Intel executives :) (Just kidding!) No luck. All these time-tested opportunities didn’t work. So, we had to start thinking real hard and even write special software for our testing needs ourselves.
Besides, we thoroughly searched all the available documentation looking for at list a hint to any possible explanation. And we managed to find this hint (very brief one, I should say). It was the word “replay”, we came across which several times in the IA-32 Intel® Architecture Optimization Reference Manual. Moreover, this word was also mentioned a few times on some slides in Intel’s early Pentium 4 presentations.
Since the documentation on Pentium 4 architecture didn’t contain any additional information about this mysterious “replay”, we started looking for its description in Intel’s patents.
Here I would like to omit all the details of our endless search, but I have to admit that the idea to study carefully Intel’s patents turned out very fruitful in the long run. It is there that we found a pretty abstract description of a special system for repeated micro-operations processing aka replay. This system is intended for repeated execution of micro-operations that have already left the scheduler but were executed incorrectly for some reason.
This way, we revealed the details about a mysterious sub-system of the Pentium 4 processor, which has hardly been described in any Pentium 4 documentation, or any articles and reviews devoted to this processor micro-architecture. Well, this is very unexpected windfall! Especially, keeping in mind that we didn’t plan going rally that deep into details at this point.
Of course, we were very curious to find out how this system affects the processor performance. Moreover, it is always a challenge to investigate the features and peculiarities of a subsystem that hasn’t been known to the general public before. We should finally be able to explain the weird behavior of our CPU, anyway.
So, we were really excited about getting this mysterious replay thing. And I have to confess that we managed to achieve some really impressive results. I would like to stress that this article is the world’s first detailed discussion of the replay feature and its functioning peculiarities. Of course, the first thing we supposed that replay will be the key to the revealed deviation of the actual processor performance from the theoretical one.
Later we arrived at the following conclusion: it looks like we found much more than we were actually looking for. This Replay turned out very interesting as a not very well-known feature of the Pentium 4 micro-architecture which clarifies some of its mysteries as well as affects the processor performance.
Moreover, when we investigated the Replay function, we also had to dwell on another very rarely mentioned subsystem called internal event counter. However, since this subsystem is not directly connected with the topic of our today’s article and can be of interest only to some technical specialists, we decided to provide all the details about it in Appendix 3. If you would like to skip additional details discussed in Appendix 3, you can go directly to Chapter IX now.
There is one topic which we haven’t yet discussed. Of course, there was no “cruel intention” behind this, we simply didn’t get a chance to talk about it. However, we cannot completely omit this important matter, so let’s say a few words about it now. This Appendix will be devoted to the processor events and events counters.
Events are actually everything that happens inside the CPU. The micro-operation was sent for execution to fast ALU – this is an event. The data has been requested from L2 cache – this is an event, too. The operands for the micro-operation sent to the FPU have arrived – this is also an event. The data is absent in the L1 data cache – is certainly an event.
So, whenever any of the processor units does something this counts as an event. Of course, among them are events of different importance and priority. Namely, the absence of the data in L1 data cache implies a succession of interconnected actions: L1 cache miss signal is generated, L2 data request is sent, the micro-operation missing the data will be processed in a specific way, the next micro-operation will be sent to he execution units in the meanwhile (if there is an independent micro-operation that has everything necessary for correct execution).
So, if there were a beautiful way to register all important events, we could get very detailed information about what’s actually happening inside the processor.
The processor designers have actually been thinking about it for a long time, too. The information about the processor units functioning is crucially important. True, how can we debug and polish off the branch prediction algorithms if we have no idea how many times the processor made an incorrect prediction and what type of predictions turned out the most difficult? How do we find out if the processor needs a larger cache if we don’t know the cache miss statistics? Does the CPU have enough execution units? Or maybe we could do with fewer execution units, because most of them are idling anyway?
In fact, the processors from different manufacturers (AMD, Intel) have had this special counter system for quite a long time. And its major task is to monitor all major events inside the processor.
The number of these counters is surprisingly big: there are a few hundreds of them! In fact, this number doesn’t seem impressive any more if we think how many events we should actually monitor to ensure proper functioning of the processor. Or if we decide to check the status of all major processor units at a given moment of time. In other words, we need to have at our disposal a huge number of parameters, if we are undertaking some serious investigation here. And all these parameters are monitored by counters.
There are special service registers storing the reports about various events. Any program analyzing the status of the processor functioning may send a request to these registers if necessary.
Moreover, both, Intel and AMD, have programs like that. Intel’s software is called VTune, and AMD’s – Code Analyst. The major task of this software tool is to analyze the processor operation and give some recommendations to the code developers. This actually means that there are software emulators of the corresponding processors integrated into these programs.
Of course, VTune uses event counters to evaluate the efficiency of this or that code.
However, everything is not that simple here. That thing is that a lot of counters in VTune are undocumented. To be more exact, a small part of the counters is documented (besides it is really hard to call a counter documented if we only know its name). In reality all the documentation does exist in the corporation. But unfortunately, it is not available for the public. I doubt that Intel’s division working on VTune development didn’t know the purpose of these counters. But the truth is that most users have no idea what these counters actually do.
To tell the truth, we were very excited about this great number of counters that we discovered. It made us think whether we really understand correctly what’s going on inside Pentium 4 processor. The number of counters we found turned out somewhat too big for the events we knew about.
Maybe there is something else inside the processor that we didn’t see?
Let me answer this question now: yes, there is. And there are quite a lot of things, actually. The ongoing chapters of our article will be devoted to these particular things. And in the meanwhile, let’s return to our event counters.
As we have already said, we discovered a lot of undocumented counters. So we asked ourselves: what do they actually do?
So, we decided to check their behavior during the execution of a special code we developed. If you have vivid imagination, you can get an idea of the fun we had comparing the registers (if the result stored got increased by one point or not) and figuring out if this counter increase corresponds to what we know about the Pentium 4 architecture.
So, we managed to check the operation of documented counters: they measure understandable parameters and their functioning corresponds to what the whitepapers say.
However, another pretty big group of counters measured some unknown parameters. We managed to find out what that actually was a bit later, when we had discovered something new (see Chapter IX for details).
Before we pass over to the discussion of the freshly revealed phenomenon called replay, we suggest revising a few important things. It will really help us later.
First of all, you all know that the performance of an abstract CPU depends a lot on the density of the commands flow fed into its execution units. Ideally the execution units should perform some effective work every single processor clock. So, the primary goal of the pipeline managing logics is to load the processor execution units to their maximum capacity.
In our particular case, in NetBurst micro-architecture, it means the following: the scheduler loading the micro-operations into the execution units should do everything possible to reduce their idling. Remember this conclusion, it will help us later.
This was an introduction. Now let’s approach the topic of our discussion from a bit different side. Let’s imagine a pipeline, where the scheduler is located right in front of the execution units. So, when the scheduler sends a micro-operation for execution, it grabs the corresponding operands and gets processed. Everything is beautiful! If the next micro-operation needs the results of the previous one, it will be able to grab them as soon as they are ready. If the micro-operation needs the results of the previous command as its operands, the scheduler will be able to send it to the execution units after a certain time interval. This time interval is determined by the time the first operation requires for execution (to be more exact, for the result of this execution to be ready for further processing).
Everything is just fine if the scheduler is right next to the execution unit. But as we remember, that one of the key goals of the NetBurst micro-architecture was to increase the processor working frequencies. As a result the pipeline got longer. So, there appear a few more stages between the scheduler and the execution unit. They are no longer next to one another.
Actually, this is not such a big problem: the micro-operations can be sent for execution in advance, taking into account the additional pipeline stages that appeared on the way. In other words, the scheduler should send out the micro-operation a few clock cycles earlier, to cover the distance between itself and the execution units right in time. It means that the commands should be sent out in advance, before the result of the previous micro-operation is available.
What’s the result of this measure? When the scheduler releases the first micro-operation, it will select the next one from the queue on the second clock cycle already. After the second micro-operation is sent along the pipeline, it will tackle the third one. And so on and so forth: the scheduler will release micro-operation onto the pipeline, the execution units will execute them, which keeps the entire pipeline busy. By the time the first micro-operation is about to be executed, the next micro-operations are already coming close taking different pipeline stages, and the scheduler is busy working on the next micro-operation already.
This way we managed to achieve maximum processing speed for the micro-operations: our execution units process a micro-operation every considered time interval. Ideally, there should be no idling in this case and the performance of this entire structure will be the highest.
But let me stop here for a while and discuss in a bit greater detail the release of commands “in advance”. Note that the scheduler sends the micro-operation for execution so that by the time it arrives at the corresponding unit all operands have already been calculated. Since it takes a few stages (clock cycles) before the operation arrives at the unit, the scheduler should be able to estimate the readiness of operands a few clocks ahead. Here it should also take into account the time it takes to execute the previous micro-operations, if their results are taken as operands for the next micro-operations. If we have an operation of fixed latency (it means we know it in advance), the task can be solved in no time. However, there are certain instructions, when you cannot predict how much time their execution is going to take. For example, when we load some data from the memory, the time wee need to complete this process will depend on the hierarchical level of the cache/memory subsystem our data are stored in.
This way, the scheduler splits all micro-operations in two groups: micro-operations with known execution time (fixed latency) and micro-operations with unknown execution time (variable latency).
Of course, the first group of micro-operation doesn’t prepare any unpleasant surprised for the scheduler: if ADD operation requires one clock cycle to be executed, then it means that the results of this addition will already be available on the next clock. And the next operation can be sent to the execution unit by the next clock, so that our pipeline gets loaded in the most efficient manner.
When we have a micro-operation of the second type defined above, the scheduler has a few options that allow it not to halt the pipeline. Say, we have a command to load some data from the memory.
First option (carefully straightforward). Suppose that we will keep in mind the worst possible instruction execution result. Here we do not consider such hopeless options as waiting for the data to arrive from the swap-file, which will take millions of clock cycles, or when the data is located very far away, say, in the RAM, which will take hundreds of clock cycles. In our example we will have the data in the L2 cache. In fact, this supposition may look unreasonable to you: why, on earth, do we need the L1 data cache with its low latency, if we don’t take advantage of this low latency? This strategy is already a failure, but we will still evaluate what this is going to cost us, once it happens.
Ok, the data is in the L2 cache. Say the distance from the scheduler to the execution unit is 6 stages (which the micro-operation will pass in 6 clock cycles respectively).
At the  moment of time we received the micro-operation. The data from the L2 cache will arrive at the moment of time indicated as [0 + L2 cache access latency]. Northwood core features 9 clock cycles L2 cache access latency in the general case (to be more exact it equals 7 clocks, but the “data load” command will first check if the data is available in the L1 data cache, which requires 2 additional clocks). So, the scheduler will send out the next micro-operation, so that it arrives at the execution unit 9 clocks later.
In fact, this option will hardly work for us, as it takes 9 clock cycles to execute only one micro-operation. We will not accept this scheduler strategy, because it definitely is not the right way to high performance.
Second option (upon agreement). The idea is to delay all micro-operations depending on the results of the data load command until the data arrives, and then start sending micro-operations for further execution. The good thing about this strategy is that it doesn’t require any additional effort: sit and wait for the data. The negative side of it is that it doesn’t always guarantee good performance in the long run.
If we had a micro-operation of the second type, the scheduler could take into account the info about its execution status from the execution units. In this case the scheduler would need to receive feedback from the execution units about the estimated execution time for the given instruction. In fact, this is quite possible (that this strategy is applied to the FPU load), however, there is one unpleasant issue.
Suppose that we were really lucky and the data is available in the L1 cache. By Northwood processor core, the data will take two clock cycles to be delivered from L1 cache.
Say, the execution unit received a micro-operation at the  time point. At the [0+2 clocks] point, it sent the status report to the scheduler and received the data from L1 cache. It immediately reports to the scheduler and the latter immediately releases the next micro-operation to the pipeline. This micro-operation will take 6 clock cycles to reach the execution unit.
Everything seems to be correct, but what have we got in the end? Let’s sum up the results: our second micro-operation will reach the execution unit in 0+2+6 clocks, because it still needs to pass all the stages between the scheduler and the execution unit: the distance between them hasn’t got any smaller. It means we need 8 clock cycles total. It turns out that the dependent instruction started moving towards the execution unit not when the data is already ready - [0+2 clocks] time point, but at [0+2+6 clocks], i.e. 6 clock cycles later. In other words, we lost 6 clock cycles!
Well, this is not the best option, I should say.
Moreover, we can easily prove that the efficiency of this strategy will reduce in general as the pipeline grows longer. As you have just seen, we got 8 clock cycles instead of 2 for the pipeline with 6 stages between the scheduler and the execution units. The resulting efficiency in this case equals 25%.
For a pipeline with only one stage distance between the scheduler and the execution unit the efficiency will increase to 67%.
For a pipeline with 666 stages we will get 668 clock cycles instead of 2. The efficiency is 0.3%.
At the same time, if the instruction takes longer to execute, this strategy may actually work much better. Say, for instance, that our pipeline features 6 stages distance between the scheduler and the execution units, but the considered instructions takes 50-100 clock cycles to execute (depending on the circumstances). However, we do not know the exact execution time from the very beginning, but only after about 25 clocks.
The execution unit received micro-operation at  time point.
At  time point the execution unit learns that it will take 51 clock to complete the operation processing.
At the same time point () the scheduler receives the same information. It waits for a while and …
At the same  time point it sends out the dependent micro-operation, which will reach the execution unit exactly at…
The  time point, when it suddenly finds the just obtained result of the previous micro-operation.
In other words, there are such situations when the combination of the pipeline length, micro-operation latency and the time this latency becomes known, that turn this strategy into something truly efficient.
This strategy is 100% efficient, when [the distance between the scheduler and functional units] is smaller than the difference between [the micro-operation latency] and [the time the latency becomes known].
The integer operations do not comply with this condition that is why this strategy doesn’t work for us here.
Third option (optimistic). From the performance point of view, the two previous options we have just discussed are not so interesting for us. The first option is awfully stupid, and the second option is too inefficient. There is only one more option left: to send instructions in advance before we know the execution status of the previous micro-operations.
Let me describe this option in a bit more detail.
The commands can be released one after another hoping for the best in terms of data loading outcome. In our case it will mean that 2 clock cycles after the data load from the memory occurs, the next micro-operation should already be sent. How can we benefit from this strategy?
At the  time point we send the data load micro-operation to the execution unit. It should reach this unit at the [0+6] time point and the scheduler knows about it.
Without waiting for this particular time point, the scheduler releases the next micro-operation at the [0+2] time point (i.e. two clocks down the pipeline from the previous command). What happens next? At [0+6] time point the data load command reaches the execution unit. The next command depending on it is 2 clocks behind. At [0+6+2] time point the data load command receives data from the cache and continues its trip down the pipeline, and the execution unit receives the second micro-operation right in time, by the time the result is ready. So, it turns out that the execution unit works two clocks in a row without pausing.
This is how micro-operations could be sent out “in advance” basing on the data load forecast. This allows loading the execution units with work in the most efficient way.
So, if the distance between the scheduler and the execution units is quite big, only the optimistic strategy can load this long pipeline with work efficiently enough. It is important, however, that the scheduler:
Well, everything seems to be turning out quite nice. We have finally found the best strategy for the long pipeline of the NetBurst micro-architecture, which allows maintaining high execution rate for the processed micro-operations. And this is really the way we have just described it. Only one important condition has to be fulfilled in this case: the instruction is really executed in the best way. In the example above where we considered data loading from the memory, the “best way” for us will be if the data is in L1 cache. This optimistic forecast has the right to exist for two reasons. Firstly, the probability that the requested data is there is very high. And secondly, data transfer from this cache takes minimum time, so the waiting will also be minimal if the data loads successfully.
But if this condition is not accomplished, then our nicely polished mechanism turns into a trap. Let me explain.
Imagine the following situation. The scheduler released a succession of four micro-operations dependent on one another. They are all on the way to the execution unit, but before the data has been loaded. Everything happens according to the optimistic strategy of the scheduler.
Here they are approaching the execution unit. And then - oh no! - it suddenly turns out that the requested data is not in the L1 cache: instead of the long-awaited operand we receive the deadly “L1 cache miss” signal.
What should be done in this case? Of course, we would have to halt the pipeline, and go look for the missing data. But the pipeline cannot be stopped just like that, it continues processing new micro-operations every clock. The scheduler has already released a succession of uop-s and we know that each next operation depends on the previous one. What will happen now?
The first micro-operation arrived at the execution unit. And since there was no data waiting for it, it was executed incorrectly. The next clock the second micro-operation receives incorrect data from the first micro-operation and is also executed incorrectly. And the same thing will happen to all uop-s until the very last one. Moreover, there appears one more serious problem.
Suppose the data was found in the L2 cache. The Northwood core will need 7 clock cycles to load the data from there. But the pipeline works only in one direction and cannot “reverse” the instructions flow. Our chain of micro-operations has already passed the execution units and was processed with the wrong operand. And if we do not undertake any urgent measures it will continue its way down the pipeline bearing more incorrect results and will be retired.
In other words, we not only executed some commands incorrectly! We also lost these four micro-operations, and even when the requested data is found we will not be able to re-execute this part of code anew. You saw that these micro-operations were taken from the Trace cache, put into the queue, processed by the scheduler and then sent to the execution device. And it is not their fault that there was no data in the L1 data cache. In other words, the CPU simply “lost” this part of the code.
Of course, this scenario is absolutely unacceptable for us.
That is why the Pentium 4 processor has a special sub-system intended to prevent the loss of micro-operations like in the situation described above. This sub-system will “hunt” micro-operations down before they retire and resend for execution.
It means that we need a kind of reverse mechanism. The simple idea behind it is that we need a possibility redirect micro-operations for execution in case the “miss L1 signal” arrives, i.e. we need a “side exit” from the pipeline.
As soon as the “lost” data is received, the micro-operations should once again go through the execution units. And only after that, once the correct results are obtained, the notorious micro-operations can retire.
So, the replay system should work as an “initial task keeper”: no matter what happens inside the CPU, we should anyway execute the code correctly.
Another important question is how many micro-operations can be sent to replay and how long can they stay there. Since replay is a kind of “emergency exit”, it cannot be of large capacity. After a while the command should be sent for re-execution. Of course, the micro-operations can only be sent for re-execution when the requested data arrives, but then these commands should be stored somewhere in the meanwhile and then selected from the storage location. Besides, all these data availability checks, the commands transfer to replay and release for re-execution should all be performed at very high speeds (as you remember, the schedulers and execution units belong to Rapid Execution Engine working at twice the processor frequency).
As a result, we have to compromise: on the one hand, we need to lose as little time as possible while waiting for the requested data, and on the other hand, we need to work at twice the core frequency that is why complex algorithm will not do.
To minimize the idling time, the “pause” should be as long as it takes to deliver data from the L2 cache. This is the next fastest memory hierarchy after L1 data cache. Besides the L2 cache is usually much bigger than the L1 data cache, which makes the probability of the requested data being in the L2 cache very high.
In fact, it is evident why the “pause” should equal the L2 cache latency. If we make the pause shorter, the data will have not enough time to reach the execution units in time, and the problem will not be solved. If the pause is longer, the data will arrive to the execution unit before the micro-operation is there, which will result into the same consequences: we will have to run the uop again.
So, Northwood core will have the minimum pause time for the data delivery equal to 7 clock cycles, the data will simply not make it earlier than that. Therefore, the micro-operation round-trip time in the replay system should be calculated so that the micro-operation could pass the scheduler (slowing down its operation at this point) and then could return to the execution unit 7 clocks later. And remember that the algorithm should be fairly simple, so that this replay system could work at high frequencies.
The replay system of the Pentium 4 processor is exactly a compromise like that: an attempt to re-execute micro-operations without increasing the complexity of their processing.
So, what does this replay look like? In fact, the replay system is none other but a part of a fictitious pipeline located parallel to the main one.
When the micro-operation leaves the scheduler, it falls into the unit called Replay multiplexer aka Replay mux (for details see our article called Replay: Unknown Peculiarities of the NetBurst Core). Then this micro-operation is cloned, i.e. its exact copy is created. The original micro-operation continues its way down the pipeline towards execution units, while the cloned micro-operation has a much more exciting destiny. There is a replay pipeline parallel to the main pipeline. The length of this fictitious pipeline equals the distance between the scheduler and the execution unit (later you will understand why).
The original micro-operation leaves the Replay multiplexer and starts off towards the execution unit, and at the same time the clone of this micro-operation starts the same movement along the fictitious pipeline without any actual processing. Both micro-operations, the original and the clone, are moving parallel to one another through all the stages on the way to the execution unit.
I know this may sound confusing, but I would still like to say that in reality, the micro-operation doesn’t move anywhere. The thing is that the replay system we are describing to you may no look exactly like this in silicon. It is only important that the reaction of the described and the real replay system on the silicon level is the same. Nevertheless, it is very convenient to explain the way this replay system works with the help of a fictitious pipeline model. To tell the truth, the exact configuration of this system is not that important for our story. It is the behavior of this system that matters most.
During micro-operation execution, a special Checker of the Pentium 4 processor checks if the data obtained as a result of the given micro-operation is “legitimate” or not.
If the answer is positive, then the micro-operation goes farther down the main pipeline towards retirement unit, and its clone on the fictitious pipeline is simply deleted. If everything was done correctly, replay system doesn’t have to interfere.
If the check indicated a “cache-miss” or some other events, such as data loading from the memory for instance (for more details about these events see our article called Replay: Unknown Peculiarities of the NetBurst Core), then replay mechanism is activated. Here I would like to stop for a second and specify that there can be two reasons for the overall execution failure: failed execution of the current micro-operation or failed execution of those uop-s, our current micro-operation depends on.
In this case the original micro-operation with incorrect operands is deleted, and its clone is sent to replay making a circle over the pipeline. It re-enters the actual pipeline right after the scheduler and gets into the multiplexer (which main function is to slow down automatically the release of the next micro-operation from the scheduler, one a micro-operation from replay comes in). So, this micro-operation starts moving towards the execution units.
We have already mentioned above that the micro-operation arrives to the execution unit on time, if the number replay pipeline stages equals the number of L2 cache latency clocks. That is 7 clocks for Northwood core and 18 clocks for Prescott core.
In fact it means that the circle a micro-operation makes should be passed in 7 (18 for Prescott) clock cycles.
When the resent micro-operation arrives into multiplexer, all other micro-operations processed by the scheduler will be slowed down, because the micro-operations coming from the replay system have higher priority over all other uop-s. To be more exact, the multiplexer will make the scheduler pause the release of micro-operations to the pipeline. You realize that higher priority of operations coming from the replay system is necessary to avoid replay overloading.
What happens if the data is not in the L2 cache? Or if there are too many incoming requests and the L2 cache will not be able to deliver the data in only 7 clocks?
Then our micro-operation will have to make another loop. It gets to replay once again and will be resend for execution for the second time. If the data again doesn’t make it within the given period of time, then the third loop will follow. If the data is coming from the memory, where the access latency can be hundreds of processor clock cycles, a command may keep circling tens and even hundreds of times wasting processor resources this way.
Now we can state with all certainty that replay system is the one responsible for the weird L2 cache latency values, which pushed us to find out the roots of this phenomenon in the Pentium 4 processor.
In the next chapter we are going to pay special attention to some key replay features and resulting consequences. We will try to keep it all simple and if you are looking for more details on the Replay mechanisms, I suggest that you check out our article called Replay: Unknown Peculiarities of the NetBurst Core.
The remarkable thing is that this solution for long pipelines functioning looks pretty logical at first glance, but then turns out to be causing dramatic performance drops. We will talk more about the reasons for that in the next chapter and now it is important to understand that: replay is the price we have to pay for the long and deep pipeline. According to Intel’s ideology, high working frequency is the No.1 priority that is why the architecture developers went for the longer pipeline. And the long pipeline required this special “reversing system” for those cases when the data hasn’t been delivered to the micro-operations on time.
Note that the fines imposed by replay do not depend on the quality of the program code or on the number of branches in it. Replay is the reverse side to the coin called “Hyper Pipeline”, i.e. the price you pay for the optimistic strategy of the scheduler. And this strategy is the only possible way the pipeline can work if the scheduler has been moved away from the execution units at a distance that exceeds the time it takes to execute most simple commands. Since the data cannot be delivered to the executed micro-operations immediately, the pipeline idles. And the worst thing about it is not that much the forced re-execution of the uop (we have to do it, there is no other way), but the necessity to re-execute the entire chain of the dependent micro-operations, no matter how long it could be. In other words, the re-execution problem spreads onto the entire micro-operations dependency chain.
So, summing up the discussion of the replay system, we can conclude that replay is bad but inevitable.
Now let’s reveal a bit more details of the discovered phenomenon.
Here I have to make a brave supposition that you are not too tired yet of all these details. Therefore, we are going to dig a little bit deeper into the replay mechanism, especially, since the details we discovered are very interesting and important for better understanding of Pentium 4 processor working principles.
Just to make sure let’s revise a few things we have already talked about in the previous chapter. We were saying that when the micro-operation gets into the replay system it is actually executed almost twice: first time incorrectly, and the second time with the correct operands. This way, the execution units had one idle cycle when the uop passed them the first time. Moreover, if the micro-operation gets into replay, it can also drag a few more micro-operations with it. In particular, we managed to create chains including thousands of commands, which were circling around the replay pipeline hundreds of times as a result of a single cache-miss! To be fair I have to stress that hundreds of replay loops is not a very frequent situation, anyway. In most cases it will all be over after a few, or a few dozens of loops. Of course, this circulation of the same micro-operations reduces the efficiency of the execution units, because about half of all commands going through them turn out idle.
Let’s take a closer look at a situation when we have a multiple rotation of a micro-operation chain within the replay loop. What does this problem arise?
The reason for that is the enthusiasm of the scheduler, strange as it might sound. To be more exact, there are two reasons: the scheduler’s vital desire to load the execution units with the maximum efficiency, and its unawareness of the current micro-operation status in the execution unit.
Let’s return to our example with a dependency commands chain. Let the first one get into replay system. Then the second one, the third one, etc, will also go into replay. There is one important thing here: since the original micro-operation is moving in the main pipeline parallel with its clone moving in the fictitious pipeline, the distance between micro-operations in the replay pipeline will remain the same. In fact, this ability to maintain the same distance between the micro-operations is one of the initial replay features: if the distances were changing, the processor logics would have much harder times managing the work of these two pipelines.
So, if there was a gap between too dependent commands, when the scheduler didn’t release any micro-operations, there will be the same gap between the clones on the replay pipeline. In the terms of semiconductor electronic hole conduction, it will be “a hole”.
This hole turns out to have very interesting features. In particular, it is these holes together with the enthusiasm of the scheduler that result into a phenomenon called replay loop for the entire chain of uop-s.
Say we have a hole between micro-operations in the replay. When they return to multiplexer after the replay (complete the first replay loop), there appears an opportunity to release for execution another micro-operation instead of the hole. And the scheduler cannot miss this opportunity for sure: its major goal is to load the execution units as efficiently as possible. And it is actually not that much the scheduler’s enthusiasm, but its complete unawareness of the situation with earlier released micro-operations execution status, so it simply follows the above described optimistic strategy blindly.
But it is all not that simple. It happens so that the next command is another dependent micro-operation from our dependency chain. But it needs the results of those micro-operations that were in front of it in the initial program code. And now it turned out ahead of these commands on the way to the execution units, because these commands were circling around the replay pipeline! Of course, it arrives to the execution unit before the data it needs is ready, it gets executed incorrectly, and will be sent to replay.
So, on the next replay loop, when the beginning of our uop-s chain has been executed with correct operands, we will be witnessing a mirror image of what we have just described: a lot of empty positions in the replay loop and one command sent there too early. And the “enthusiastic’ scheduler will immediately fill all the empty positions of the replay loop with new micro-operations, including those positions that were before our micro-operation and were intended to follow it, and not to precede!
I believe most readers have already got the point here… Absolutely correct: the entire micro-operations chain will go to replay. Our command will finally be executed, and all other commands will create a familiar situation going through the replay: a replay loop with micro-operations and a hole between them. The circle is closed: the scheduler continues sending new commands from the dependency chain to the execution units fitting them into every available hole between the micro-operations returned to the replay system, however, it has no idea what commands have already been executed successfully.
So, what do we get in the long run? It turns out that the entire micro-operations dependency chain should go through replay. For a more illustrative example, imagine that this dependency chain is a metal chain of small units linked together. Wrap it around a pen, so that there is a loop, and then pull one end of it. What do you see? The entire chain will slide around the pen.
It is absolutely the same with our chain of micro-operations. It will have to make at least one replay round, and maybe even more depending on how long it takes to get the data ready.
And this process can be over only when the chain of dependent micro-operations is over. So, it looks like a single “hole” can result into an “endless” replay, unless the entire chain of micro-operations is completed. Or unless some other “magical” system steps in.
In our example it means that the loop around the pencil will end only when the chain ends. Or when the chain breaks. That is when the scheduler suddenly receives a command that doesn’t belong to our dependency chain and fits it into the available “hole”. This way the replay will end gradually.
But if there is more than one chain loop around the pencil, then the replay exit mechanism we have just suggested may not work. There are even such chains of commands that will never allow breaking the replay cycle.
This situation is the worst consequence of the replay. One thing is when a single command gets executed a few times instead of just once. It is frustrating, but we can live with that. It is a completely different story when a pretty significant part of the code is executed at least twice. In fact, the processor efficiency drops here, and the more replay loops are traveled, the more times lower the processor efficiency gets. So, the efficiency will get at least twice as low!
Should we be concerned with the low processor efficiency in this case? Will it affect the performance (the processing time of the code in the replay)? Let’s check out our example with a “hole” once again. Note that the scheduler had to wait for 7 clock cycles every two loops (14 clock cycles in Northwood), before it could sent a new command for execution. It means that in our case not only the efficiency dropped: the performance also got twice as low!
This explains very clearly why Pentium 4 processor in certain cases yields in performance to its predecessor (!), Pentium III, despite its evident theoretical advantages, such as higher working clock frequency, faster bus, larger and faster cache and higher IPC (instructions per cycle). Note that replay is very often more than just “a
stop and pipeline clearance for the code with too many branches”.
The most interesting thing is that if the scheduler could only halt the execution for a few clock cycles (exactly for as long as the chain of commands needs to get back from the replay system), we would have no problems at all. But the good intention to use the resources with maximum efficiency and to maintain high processing speeds, as well as unawareness of the situation down the pipeline, lead to absolutely opposite consequences. The execution resources get simply wasted. As we have already mentioned above, the operations that fall into replay are executed at least twice. The maximum number of executions per single operation can reach tens and even hundreds of times (in exceptional cases). This will inevitably cause a significant performance drop of our CPU on this part of the code, although the performance drop will certainly be not as dramatic as the efficiency drop.
It means that “thanks to” replay, the performance of our processor dropped at least twice and maximum tens and hundreds of times during the execution of a given part of the code! Well, it looks like the good old saying “easy does it” is absolutely true here.
When we studied the way micro-operations are moving along the replay pipeline, we discovered that sometimes certain commands may stay there longer than they actually should. For example, this situation occurred in case or aliasing errors or in case of a D-TLB miss. We got the impression that the scheduler has more than one replay loop.
And this appeared to be true. Some types of errors in Northwood core (such as aliasing, for instance, for more details see our article called Replay: Unknown Peculiarities of the NetBurst Core) cause the operations transfer to a different replay loop, which is 12 clock cycles long, unlike 7 in the first case.
It happens because in some cases we need to perform a special non-standard error check. The result of this check becomes available a bit later than usual. However, we have a limited amount of time for turning the operation into replay loop if there is any problem. If the result of this check doesn’t arrive on time, the micro-operation will be executed incorrectly and will continue its way down the pipeline and will retire thus causing a catastrophe: the program code has been executed incorrectly.
Since the check we are talking about is situated at a farther distance from the scheduler, the “rejected” micro-operation will have to travel along a bigger loop. This loop is 12 clock cycles long, compared with 7 clock cycles for the first smaller loop.
We called the above described replay loops according to their length in clocks: RL-7 and RL-12 respectively.
Here is the result: Northwood core has two replay loops. Prescott core has only one replay loop, RL-18 (for details see our article called Replay: Unknown Peculiarities of the NetBurst Core). This is connected with the fact that since the L2 access latency grew bigger, we now have more time to perform the check, so the CPU manages to complete it within a single pass.
Here I would like to draw your attention to the fact that we have been considering only one scheduler all this time. From Chapter VI we remember that there are FIVE schedulers like that in the Pentium 4 core. They are all independent of one another, each of them has its own queue, and it means…
It means that each scheduler has its own replay system. In other words, there are 10 fictitious pipelines hidden from the user in Northwood core!
Wow, the size of that part of Pentium 4 processor we haven’t heard anything about before is impressive. We can’t help asking ourselves: can THIS be called a beautiful architectural solution?
In Prescott core things got a bit different. There is only one replay loop for each scheduler, but these loops are longer: 18 clocks each. So, we have the total of five fictitious pipelines. Keeping in mind that at least a part of them is working at twice the core frequency (together with fast ALU pipelines), and differential LVS logics is used, we are no longer surprised at the amount of heat dissipated by the Prescott core: this is all quite natural. Replay makes the pipeline idle at least twice per single operation that gets there. So, since there is more work to be done for the same piece of the program code, more heat is generated.
We can also make a brave supposition about the way the additional transistors of the core were spent. As you remember, there is no definite information about the number of transistors in Prescott core: the officially claimed number of transistors is too high for the banal doubling of the L2 cache size. So, we dare assume that there is “something else” there. This “something” can be a combination of five replay loops and ALU with EM64T support.
But, let’s return back to the replay. It turns out that replay system can theoretically lead to complete blocking of the CPU. In particular, in our example you can see that if there is a hole, the commands in the replay loop change their initial order. It may turn out that the commands dependency chain circling around the replay loop can only get out of there when the command arrives that is currently still in the scheduler. If the replay loop is completely full, there is no “hole” for this long-awaited command to fit in, the cycle will never break, because the replay commands have higher execution priority. This blocking is called “livelock”, and it cannot be resolved with the nominal means.
Nevertheless, the CPU doesn’t get to the livelock in reality, as the practical experience suggests. It implies that there is some emergency system which resolves the problem somehow when necessary. Getting a little bit ahead of our story, I would like to say that this system is most likely to be breaking the “endless” replay after a few dozens of loops (for details about this system see our article called Replay: Unknown Peculiarities of the NetBurst Core).
So, we understand that besides replay, Pentium 4 processor also has at least one unknown (!) emergency system. It serves to resolve livelock situations.
In fact, there are two systems: one of them discovers the problem and another one resolves it. Here we should give due credit to brave architectural engineers that put all this into life.
However, we keep studying these systems, and now we suggest turning to such interesting matter as replay and FPU.
We haven’t even mentioned the floating-point operations when we were talking about the replay all this time. And there is a good explanation to that. The thing is that the replay system communicates in a different way with FPU commands.
The loading of FPU, MMX and SSE2 registers from L1 cache takes much longer than the loading of integers (9/12 clock cycles against 2/4 clock cycles by Northwood/Prescott respectively). These additional 7/8 clock cycles are just enough to arrange the feedback between the scheduler and execution units. While we are waiting for fp_load command to be executed, we have just enough time to let the scheduler know if there is an L1 cache miss. The scheduler will take into account this sad news and will not release the dependent FPU/MMX/SSE2 instructions for execution. In other words, before these operations are sent for execution, they manage to check if the operands are already available. This automatically eliminates the main reason for replay to occur. And in fact, this is very handy, because NetBurst processor architecture doesn’t contain any additional FPU units. There is only one FPU unit processing one instruction per clock cycle (while ALU processes six instructions per clock cycle). So, if the operations in the replay will waster the resources of this unit, the overall processor performance will inevitably drop.
As a result, FPU commands never get into the RL-7 replay loop. Nevertheless, they will get into RL-12 replay loop still. For example, if the FPU micro-operation depends on the results of an integer micro-operation, which has already got into RL-7 loop.
In conclusion I would like to point out two more interesting facts connected with the FPU operations:
Now we have to discuss one more application for the replay system.
In the examples above we discussed the situation when replay is used to resolve the issues caused by cache miss. In fact, this is far not the only function replay system performs for the Pentium 4 processor. It used to solve very diverse problems. I would even say: everywhere where possible.
In particular, replay is used to help with a very frequent matter (sincerely disliked by software developers): data downloading right after uploading.
This is what the problem is actually about. When we received some data and executed the Store command for it, so that the data is stored in the memory, we need to make sure some time passes before this particular data can be read from the memory again.
We will discuss this whole situation in greater detail in our article called Replay: Unknown Peculiarities of the NetBurst Core. Here I would only like to stress that if not enough time has passed after the saving operation is complete, replay may step in as a way-out.
The worst thing is that there is nothing software developers can do to prevent replay: the CPU very aggressively reorders all instructions inside, that is why any data loading command may change its position in relation to other instructions in the program code, even if it used to be placed very far way initially (and as we know new independent command threads usually start with the data loading).
As a result, the processor tries to load data too early, and this operation gets sent to replay. And the entire dependency chain follows after it.
What does this actually mean? Operations like that always follow the function call: the calling command stores the parameters in the stack and the called command reads them from the stack. Function calls are present in all programs, with no exceptions. So, here is the conclusion: all programs with no exceptions have situations favorable for replay.
And in conclusion I would like to say a few words about the interaction between replay and Hyper Threading.
As you remember, Hyper Threading technology is intended to increase the efficiency of highly loaded processor units. Since replay eats up some of the execution unit resources, we were wondering if there is any influence there. And if there is any mutual influence, then how big it is? The answer is traditionally given in our article called Replay: Unknown Peculiarities of the NetBurst Core. So, you might want to check it out :)
From, the general suppositions it is clear that the more workload falls upon the processor execution units, the less efficient Hyper Threading technology becomes. At the same time, replay causes a number of operations to be executed multiple times, which will eat up processor resources. So, the two subsystems tending to use the same resources will face a mutual conflict.
The result of our investigation was up to expectations.
Replay system can reduce the efficiency of Hyper Threading technology significantly. In particular, in certain situations replay can cause the overall performance loss of up to 45% in Northwood core and up to 20% in Prescott core. Moreover, the increase in the efficiency of Hyper Threading technology that we observe on Prescott processors is most likely connected with the replay improvement rather than with the enhancement of the Hyper Threading technology.
Well, I think we should take a break now. Especially, since we have already collected a lot of interesting material and now it’s high time we summed it all up and drew some conclusions. Besides, I believe that if you read that far, you definitely need a break, too. :)
We would like to say that it wasn’t easy or fast before we figured out all the tiniest details of the replay mechanism. In fact, we have postulated the existence of replay system back in February 2004, and since then we have been studying its working principles and the influence it imposes over all other components of the processor architecture.
Besides, when we were working on the article, we faced an evident contradiction. On the one hand, replay and its features are really interesting and haven’t been yet described in that much detail anywhere. So, we feel like providing as much of the indepth info as possible about the Pentium 4 processor operation.
On the other hand, this is a very specific processor peculiarity, and we have no idea how interesting this is going to be for you, our readers. We are really uncertain that most of you ever go that far into the details...
So, in order to make this subject interesting to the most of X-bit readership groups, we did our best to deliver the message in a simple and easy to understand manner, at the same time retaining the technical correctness and level of detail, in order to please the techy part of our visitors, too.
It is actually up to you to decide if we managed to accomplish our goals or not. If it was an interesting read, if you remembered and learned something, if you realized at least for a second how complex the processor architecture actually is, then our efforts were not vain :)
So, let’s make some conclusions about the replay and our investigation of the Pentium 4 micro-architecture.
Replay is an inalienable part of NetBurst ideology. This part of it has been unknown to the general public for a while. But this mechanism ensures proper functioning of the Pentium 4 micro-architecture that is why it is worth paying special attention to.
Replay is negative for the processor performance. However, this is price we had to pay for longer pipeline and considerably higher working frequency. It is quite possible that complex replay mechanism, its negative influence on the processor performance, and additional overheating it causes forced Intel to cancel Tejas core, which was supposed replace Prescott. At least, this hypothesis explains what we see well enough (of course, only Intel management knows the true motives behind this decision).
We hope that our description of the replay system managed to fill the informational vacuum around one of the most interesting and mysterious subsystems of the Pentium 4 CPU.
We believe that this information must be revealed at least in the micro-architecture descriptions and optimization guides: those who tend to optimize their software for maximum performance should know about the “hidden dangers” of this process.
On the other hand, we do understand why Intel didn’t do it: they would hardly manage to describe replay without making a negative impression on the potential customers. And the negative impression is definitely what any commercial corporation is hunting for. Unfortunately, their strategy in this respect, namely concealment of the replay existence, is also not the most reasonable thing to do. There is a very thin border between marketing and deceit, and it looks like in this case marketing seems to have crossed it.
Unfortunately, replay affects the Pentium 4 processor performance in a negative way. The only thing that justifies its existence is the fact that Pentium 4 processor will not work correctly at all without the replay.
Anyway, we are not going to stop here and we intend to continue investigating the Pentium 4 mysteries. It is simply time to take a break and to look back: what have we achieved during the past year of hard work? Which way shall we take in our further investigations? What tasks shall we set for our study of Pentium 4 architecture?
Some tasks have already been set, actually. For example the launch of Pentium 4 6xx series made it very important to find out how effectively the 64-bit instructions support has been implemented there? And we are already working on it. Hopefully we will be able to share some results with you soon enough.
There are still questions left about the replay, so we will continue investigating a few pretty interesting sides of this mechanism.
Anyway, there are still a lot of things we could dig in. It is even sad in a way that the second article pf the trilogy has come to an end :)
Stay tuned for the next part of our detailed investigation, which will be called Replay: Unknown Peculiarities of the NetBurst Core!
1. Hyper-Threading Technology Architecture and Microarchitecture, vol6iss1_art01.pdf
Deborah T. Marr, Desktop Products Group, Intel Corp.
Frank Binns, Desktop Products Group, Intel Corp.
David L. Hill, Desktop Products Group, Intel Corp.
Glenn Hinton, Desktop Products Group, Intel Corp.
David A. Koufaty, Desktop Products Group, Intel Corp.
J. Alan Miller, Desktop Products Group, Intel Corp.
Michael Upton, CPU Architecture, Desktop Products Group, Intel Corp.
2. Hyper- Threading Technology in the Netburst Microarchitecture, 05_marr.pdf
Debbie Marr, Hyper- Threading Technology Architect, Intel Corp.
3. Pipeline Depth Tradeoffs and the Intel Pentium 4 Processor, 25intel-p4.pdf
Doug Carmean, Principal Architect, Intel Architecture Group
4. Intel Pentium 4 Processor Specification Update, 24919950.pdf
5. IA-32 Intel Architecture Optimization, 24896611.pdf
6. The Microarchitecture of the Pentium 4 Processor, art_2.pdf
Glenn Hinton Desktop Platforms Group, Intel Corp.
Dave Sager, Desktop Platforms Group, Intel Corp.
Mike Upton, Desktop Platforms Group, Intel Corp.
Darrell Boggs, Desktop Platforms Group, Intel Corp.
Doug Carmean, Desktop Platforms Group ,Intel Corp.
Alan Kyker, Desktop Platforms Group, Intel Corp.
Patrice Roussel, Desktop Platforms Group, Intel Corp.
7. The Intel Pentium 4 Processor, carmean.pdf
Doug Carmean, Principal Architect Intel Architecture Group
8. Inside the Pentium 4 Processor Microarchitecture, P4_carmean_pipe.pdf
Doug Carmean, Principal Architect Intel Architecture Group
9. Intel Pentium 4 Processor on 90nm Process Datasheet, 30056102.pdf
10. The Microarchitecture of the 90nm Intel Pentium 4 Processor
Darrell Boggs, Desktop Products Group, Intel Corp.
Aravindh Baktha, Desktop Products Group, Intel Corp.
Jason Hawkins, Desktop Products Group, Intel Corp.
Deborah T. Marr, Desktop Products Group, Intel Corp.
J. Alan Miller, Desktop Products Group, Intel Corp.
Patrice Roussel, Desktop Products Group, Intel Corp.
Ronak Singhal, Desktop Products Group, Intel Corp.
Bret Toll, Desktop Products Group, Intel Corp.
K.S. Venkatraman, Desktop Products Group, Intel Corp.
11. The Microarchitecture of the Intel Pentium 4 Processor on 90nm Technology, vol8iss1_art01.pdf
12. LVS Technology for the Intel Pentium 4 Processor on 90nm Technology, vol8iss1_art04.pdf
13. 64-Bit Extension Technology Software Developer’s Guide, Vol. 1, 30083401.pdf
14. 64-Bit Extension Technology Software Developer’s Guide, Vol. 2, 30083501.pdf
15. Intel Xeon Processor MP with up to 2MB L3 Cache (on the 0.13 Micron Process) Datasheet, 25193102.pdf
16. Intel Pentium 4 Processor with 512-KB L2 Cache on 0.13 Micron Process and Intel Pentium 4 Processor Extreme Edition Supporting Hyper-Threading Technology Datasheet, 29864312.pdf
17. Low Voltage Swing Logic Circuits for a Pentium 4 Processor Integer Core, 40_3.pdf
Daniel J. Deleganes, Micah Barany, George Geannopoulos
Kurt Kreitzer, Anant P. Singh, Sapumal Wijeratne, Intel Corporation
18. Intel patent ? 6,163,838 “Computer processor with a replay system”
19. Intel patent ? 6,094,717 “Computer processor with a replay system having a plurality of checkers”
20. Intel patent ? 6,385,715 “Multi-threading for a processor utilizing a replay queue”