In other words, we not only executed some commands incorrectly! We also lost these four micro-operations, and even when the requested data is found we will not be able to re-execute this part of code anew. You saw that these micro-operations were taken from the Trace cache, put into the queue, processed by the scheduler and then sent to the execution device. And it is not their fault that there was no data in the L1 data cache. In other words, the CPU simply “lost” this part of the code.
Of course, this scenario is absolutely unacceptable for us.
That is why the Pentium 4 processor has a special sub-system intended to prevent the loss of micro-operations like in the situation described above. This sub-system will “hunt” micro-operations down before they retire and resend for execution.
It means that we need a kind of reverse mechanism. The simple idea behind it is that we need a possibility redirect micro-operations for execution in case the “miss L1 signal” arrives, i.e. we need a “side exit” from the pipeline.
As soon as the “lost” data is received, the micro-operations should once again go through the execution units. And only after that, once the correct results are obtained, the notorious micro-operations can retire.
So, the replay system should work as an “initial task keeper”: no matter what happens inside the CPU, we should anyway execute the code correctly.
Another important question is how many micro-operations can be sent to replay and how long can they stay there. Since replay is a kind of “emergency exit”, it cannot be of large capacity. After a while the command should be sent for re-execution. Of course, the micro-operations can only be sent for re-execution when the requested data arrives, but then these commands should be stored somewhere in the meanwhile and then selected from the storage location. Besides, all these data availability checks, the commands transfer to replay and release for re-execution should all be performed at very high speeds (as you remember, the schedulers and execution units belong to Rapid Execution Engine working at twice the processor frequency).
As a result, we have to compromise: on the one hand, we need to lose as little time as possible while waiting for the requested data, and on the other hand, we need to work at twice the core frequency that is why complex algorithm will not do.
To minimize the idling time, the “pause” should be as long as it takes to deliver data from the L2 cache. This is the next fastest memory hierarchy after L1 data cache. Besides the L2 cache is usually much bigger than the L1 data cache, which makes the probability of the requested data being in the L2 cache very high.
In fact, it is evident why the “pause” should equal the L2 cache latency. If we make the pause shorter, the data will have not enough time to reach the execution units in time, and the problem will not be solved. If the pause is longer, the data will arrive to the execution unit before the micro-operation is there, which will result into the same consequences: we will have to run the uop again.