We have already mentioned above that the micro-operation arrives to the execution unit on time, if the number replay pipeline stages equals the number of L2 cache latency clocks. That is 7 clocks for Northwood core and 18 clocks for Prescott core.
In fact it means that the circle a micro-operation makes should be passed in 7 (18 for Prescott) clock cycles.
When the resent micro-operation arrives into multiplexer, all other micro-operations processed by the scheduler will be slowed down, because the micro-operations coming from the replay system have higher priority over all other uop-s. To be more exact, the multiplexer will make the scheduler pause the release of micro-operations to the pipeline. You realize that higher priority of operations coming from the replay system is necessary to avoid replay overloading.
What happens if the data is not in the L2 cache? Or if there are too many incoming requests and the L2 cache will not be able to deliver the data in only 7 clocks?
Then our micro-operation will have to make another loop. It gets to replay once again and will be resend for execution for the second time. If the data again doesn’t make it within the given period of time, then the third loop will follow. If the data is coming from the memory, where the access latency can be hundreds of processor clock cycles, a command may keep circling tens and even hundreds of times wasting processor resources this way.
Now we can state with all certainty that replay system is the one responsible for the weird L2 cache latency values, which pushed us to find out the roots of this phenomenon in the Pentium 4 processor.
In the next chapter we are going to pay special attention to some key replay features and resulting consequences. We will try to keep it all simple and if you are looking for more details on the Replay mechanisms, I suggest that you check out our article called Replay: Unknown Peculiarities of the NetBurst Core.
The remarkable thing is that this solution for long pipelines functioning looks pretty logical at first glance, but then turns out to be causing dramatic performance drops. We will talk more about the reasons for that in the next chapter and now it is important to understand that: replay is the price we have to pay for the long and deep pipeline. According to Intel’s ideology, high working frequency is the No.1 priority that is why the architecture developers went for the longer pipeline. And the long pipeline required this special “reversing system” for those cases when the data hasn’t been delivered to the micro-operations on time.
Note that the fines imposed by replay do not depend on the quality of the program code or on the number of branches in it. Replay is the reverse side to the coin called “Hyper Pipeline”, i.e. the price you pay for the optimistic strategy of the scheduler. And this strategy is the only possible way the pipeline can work if the scheduler has been moved away from the execution units at a distance that exceeds the time it takes to execute most simple commands. Since the data cannot be delivered to the executed micro-operations immediately, the pipeline idles. And the worst thing about it is not that much the forced re-execution of the uop (we have to do it, there is no other way), but the necessity to re-execute the entire chain of the dependent micro-operations, no matter how long it could be. In other words, the re-execution problem spreads onto the entire micro-operations dependency chain.
So, summing up the discussion of the replay system, we can conclude that replay is bad but inevitable.
Now let’s reveal a bit more details of the discovered phenomenon.