Instructions can be executed incorrectly for multiple reasons. Besides the dependence on the results of the previous instructions, we could list the following external conditions: L1 cache miss, incorrect store-to-load-forwarding, hidden data dependencies, etc.
Let’s find out what the replay system looks like (Pic.3a, Pic.3b). The Scheduler output is connected to the Replay Mux. Then the operations from Replay Mux are sent to two pipelines. The first pipeline is the main one, it delivers the commands to the execution units. The second pipeline belongs to the replay system directly and contains empty stages, which do not do any specific work, but which number until the Check stage is the same as the number of stages in the main pipeline. The second pipeline receives exact copies of the operations going in parallel to the first pipeline.
The operations go along both pipelines in parallel until they reach the Check stage. Here the Checker unit verifies if the operation in the main pipeline has been executed correctly. If everything is alright, the operations retire (Pic.3a). if it turned out that the incorrect result was obtained for this or that reason (for example we got L1 Miss signal), then the chain of operations from the second pipeline is sent back to the Replay Mux through a replay loop (Pic.3b). At the same time (if we have an L1 cache miss, for instance) a request is sent to the next caching level (L2 cache). The replay loop may contain additional “fictitious” stages (on the picture they are marked as STG. E and STG. F). The number of these stages is adjusted so that the delay of the operation and pipeline for the complete loop could be just enough for the data to arrive from the new cache level (for example, the L2 cache latency, i.e. 7 clock cycles).
By the time the reversed command is expected to arrive at the replay mux, the Checker unit sends a special signal to the scheduler (stop signal) so that the scheduler could reserve a free slot in the next clock. The replay mux will insert the command returned for repeated execution into this slot. All commands, which depend on the incorrectly executed operation will also be returned for re-execution. Note that the distance between these commands equals fixed number of stages.
Here I have to stress right away that the commands can be sent to replay multiple times. For instance, the data from L2 cache can simply arrive too late, once a lot of repeated requests to the L2 cache occur. In this case one or two additional loops might be necessary, which will increase the L2 reading latency. For example, the data may arrive into L2 cache in 9 clock cycles instead of 7, and an additional loop will add at least 7 clock cycles to that. If the data is simply not in L2 cache at all, then the chain of commands depending on this data will be rotating in the replay system until the requested data arrives from the main memory.
Additional replay loops are one of the reasons why the actual L2 cache latency may turn out much higher than the number claimed in the official documents.