Infinite Replay Looping
In the previous example we discussed the situation when a command from an independent chain fits into the “hole” between the dependent commands. Unfortunately the scheduler of contemporary NetBurst processors is not “smart enough” to track down the entire chain of dependent commands sent to replay. In other words, the scheduler always considers the operation to be performed correctly. Now let’s take a closer look at the situation when a command from the same dependency chain fits into the “hole”. This matter is very important for the proper performance estimates, because the indirect dependency chains can be very frequently found in programs.
As the simplest example here I would like to offer you a Total Sum calculation, when the result of every next operation depends on the result of the previous one directly.
for (int I = 0; I< count; I++)
Well, let’s analyze how the replay works with the help of this example of a long data dependency chain (see Pic.5 below).
LD R1, [X] // loading X into register R1
ADD R1, R2 //1 – R1 = R1+R2
ADD R1, R2 //2 – R1 = R1+R2
ADD R1, R2 //3 – R1 = R1+R2
ADD R1, R2 //4 – R1 = R1+R2
ADD R1, R2 //5 – R1 = R1+R2
ADD R1, R2 //6 – R1 = R1+R2
ADD R1, R2 //7 – R1 = R1+R2
ADD R1, R2 //8 – R1 = R1+R2
ADD R1, R2 //9 – R1 = R1+R2
Just like in the previous example (Pic.4b) we will take the L1 load latency equal to 2 clock cycles, and the ADD latency – to 1 clock cycle. X is not in L1 cache (L1 miss), but is found in L2 cache.
In the model on Pic.5 the scheduler sends a chain of commands to the pipeline without any problems keeping in mind their estimated execution latency. Just like in the example on Pic.4b LD command reaches Hit/Miss stage on the first pipeline at 6th clock cycle, where L1 Miss occurs, and its copy on the second pipeline reaches the Check stage where it receives an L1 miss signal and turns to replay. All the following ADD commands cannot be executed correctly, so they also go to replay. At 8th clock cycle LD command reaches the replay mux when the signal arrives that it should stop and let one slot free for the LD command, so replay mux resends LD at the next 9th clock cycle. After that, beginning with the 10th clock cycle, the negative consequences of the replay start popping up. The scheduler cannot track down the entire dependency chain beginning with LD that is why it inserts ADD7 into the “hole” between LD and ADD1 at 10th clock cycle. Of course, ADD7 cannot be executed correctly before ADD6 is, but in reality ADD7 appears ahead of ADD6 in the pipeline. That is why when LD operation retires at 14th clock cycle after obtaining the correct data, while ADD7 turns to replay and so do ADD8, ADD9 and all other commands following them. There is a big “hole” between ADD7 and ADD8 commands, where other commands will fit, and so on and so forth. Moreover, these “holes” can be created by the commands from other chains, which the scheduler sent for execution during commands rearrangement.
So, we see the following picture: the scheduler doesn’t know what commands have been executed correctly, but it continues sending new commands from the dependency chain for execution inserting them into “holes” between commands sent to replay. Since not all the previous commands have been executed correctly, the commands in the “holes” will also be executed incorrectly and will have to get back to replay. All commands of this chain will keep rotating in the replay loop one by one like links of the chain circled around a rod. If we create a time diagram representing the Dispatch stage at each given moment of time, it will look as follows: