In our example it means that the loop around the pencil will end only when the chain ends. Or when the chain breaks. That is when the scheduler suddenly receives a command that doesn’t belong to our dependency chain and fits it into the available “hole”. This way the replay will end gradually.
But if there is more than one chain loop around the pencil, then the replay exit mechanism we have just suggested may not work. There are even such chains of commands that will never allow breaking the replay cycle.
This situation is the worst consequence of the replay. One thing is when a single command gets executed a few times instead of just once. It is frustrating, but we can live with that. It is a completely different story when a pretty significant part of the code is executed at least twice. In fact, the processor efficiency drops here, and the more replay loops are traveled, the more times lower the processor efficiency gets. So, the efficiency will get at least twice as low!
Should we be concerned with the low processor efficiency in this case? Will it affect the performance (the processing time of the code in the replay)? Let’s check out our example with a “hole” once again. Note that the scheduler had to wait for 7 clock cycles every two loops (14 clock cycles in Northwood), before it could sent a new command for execution. It means that in our case not only the efficiency dropped: the performance also got twice as low!
This explains very clearly why Pentium 4 processor in certain cases yields in performance to its predecessor (!), Pentium III, despite its evident theoretical advantages, such as higher working clock frequency, faster bus, larger and faster cache and higher IPC (instructions per cycle). Note that replay is very often more than just “a
stop and pipeline clearance for the code with too many branches”.
The most interesting thing is that if the scheduler could only halt the execution for a few clock cycles (exactly for as long as the chain of commands needs to get back from the replay system), we would have no problems at all. But the good intention to use the resources with maximum efficiency and to maintain high processing speeds, as well as unawareness of the situation down the pipeline, lead to absolutely opposite consequences. The execution resources get simply wasted. As we have already mentioned above, the operations that fall into replay are executed at least twice. The maximum number of executions per single operation can reach tens and even hundreds of times (in exceptional cases). This will inevitably cause a significant performance drop of our CPU on this part of the code, although the performance drop will certainly be not as dramatic as the efficiency drop.
It means that “thanks to” replay, the performance of our processor dropped at least twice and maximum tens and hundreds of times during the execution of a given part of the code! Well, it looks like the good old saying “easy does it” is absolutely true here.