The new commands will keep getting into replay loop until the dependency chain ends. The negative influence of the replay on the rest of the chain results not only into the increased operations latency, but also into absolutely inefficient waste of computational resources, because all the operations in the replay system have to be executed at least twice: first time before and second time after the replay. And as we have already pointed out, the commands can be re-executed a few times while waiting for the data to be ready, so the resources workload may grow more than twice.
In real applications the loops of dependent commands may disappear, if some events cause pipeline clearing, or the “holes” will be “patched”. Here I mean that there are enough commands from independent chains fitting into the “holes” between the operations sent back to replay (like in the example on Pic.4b). However, it is extremely hard to predict in the program code when “patches” like that might be needed, since the final decision about the order of commands execution depends on the scheduler, which may not necessarily “patch the hole” in the right moment of time.
Moreover, you know already that Pentium 4 processor features 5 schedulers. Each scheduler has its own replay system, so the commands of different types circle inside different replay loops thus increasing the number of “holes” in each of them, which automatically increases the probability of the dependent commands looping.
As we managed to find out, the “holes” between commands resent for repeated execution, which create long-lasting replay looping, are the major reason why the MOV EAX, [EAX] succession dependency chain will not help during the L1 cache latency measurements in the Pentium 4 processor. The existence of “holes” also explains the graph we offered you in the very beginning of this article. It turns out that once the commands from the dependency chain fall into the “holes” and worst comes to worst, they start lopping more than actually necessary thus increasing the execution latency tremendously. The number of these “loops” they make depends on the combination of ‘holes”, load commands and the number of commands between loads.
Our investigation used undocumented command counters for the operation sent to replay. Besides, we also developed special tests where we arranged the commands between the loads in such a way that the created “patches” wouldn’t let the commands from the dependency chain to get into the “holes” between the replayed commands. Our results proved the theory: if the “holes” are “patched” in time with other commands the succession will be protected against looping and the program code will be executed much faster, and the measured cache latency will match the values claimed in the official documents.
This way, the usual way of calculating the latency doesn’t work for NetBurst architecture (such as the latencies in case of L1 cache miss, for example). The actual number will include not only the latency itself, but also the additional delay caused by replay looping. The looping may actually take quite long, so the effective latency may reach hundreds of clocks instead of only 9. The worst thing however, is that replay delays the execution of not just once instruction, but blocks some execution resources, that could have been used for other independent operations in the meanwhile.