Replay: Unknown Features of the NetBurst Core

In the third part of our NetBurst Architecture investigation trilogy we are going to reveal the details of the Replay mechanism Implemented in Intel Pentium 4 processors, which Intel keeps quiet about. This particular mechanism and its working principles explain why Pentium 4 processors perform pretty slowly, despite their high working frequencies.

by Victor Kartunov , Yury Malich , Jan Keruchenko aka C@t , and Vadim Levchenko aka VLev
06/06/2005 | 04:20 PM

Since the day Intel announced its Pentium 4 processor a lot of questions appeared about the strange results this processor demonstrated in a number of tasks. Although Pentium 4 processors boasted higher working frequencies and specific architectural features, such as Trace Cache, Rapid Execution Engine, Quad-Pumped Bus, Hardware prefetch and even Hyper-Treading, which were supposed to increase the number of commands to be processed per processor clock, Pentium 4 processors turned out unable to outperform their counterparts (Pentium M) as well as their competitors (AMD Athlon) working at lower frequencies. Most reviewers would usually explain these performance issues with the longer pipeline and sometimes with the small cache memory capacity or higher memory latency. Quite rarely some other reasons would be suggested here.

 

However, all these things I have just mentioned fail to really explain certain anomalies, which you can come across during your tests. As an example, let’s consider a situation when we test memory latency with a chain of dependent commands mov eax, [eax] (the so-called pointer-chasing) "with aggravation", when the chain of dependent load commands is enlarged with a chain of ADD operations: X * { mov eax,[eax] - N*{add eax, 0} }.

If we know how long the addition takes, we can determine time T for the load operations as the time required for single iteration processing minus the time required for a chain of N additions. If everything had been fairly simple, then the T (N) dependence graph would have been a horizontal line, with the location determined as the ideal L2 cache access time, i.e. 9=2+7. In reality the graph looks as follows, and it is simply impossible to explain its shape and behavior with the documentation and info Intel’s optimization guides offer us:


Pic. 1: Pentium 4 (Northwood) L2 cache latency testing
with a dependency chain X*{mov eax,[eax] - N*{add eax,eax}}.

Luckily there is at least one hint in the optimization guides. This is a very scarce and superficial description of a mechanism called replay. Here is a quote:

«Replay

In order to maximize performance for the common case, the Intel NetBurst micro-architecture sometimes aggressively schedules µops for execution before all the conditions for correct execution are guaranteed to be satisfied. In the event that all of these conditions are not satisfied, µops must be reissued. This mechanism is called replay.
Some occurrences of replays are caused by cache misses, dependence violations (for example, store forwarding problems), and unforeseen resource constraints. In normal operation, some number of replays are common and unavoidable. An excessive number of replays indicate that there is a performance problem

This scarce explanation gives us to understand that replay may cause serious problems in case a cache-miss occurs. In fact, it occurred to us after reading this description that replay could possibly explain the shape of the L2 cache latency graph. Our search for additional information in official documents and articles ended in vain. All the data we could dig out comes from patents.

So, the article you are about to read appeared as a result of our detailed study of the following Intel patents:

Also we carried out and analyzed the whole bunch of benchmarks. We paid most attention to Northwood processor core here. As for the detailed study of the Prescott processor core, we are still working on it, as it requires a lot of time and resources.

What Do We Need the Replay For?

The main feature of the NetBurst pipeline distinguishing it from the Intel P6 and AMD K7/K8 pipelines is the appearance of a few additional stages between the scheduler and the execution unit.


Pic. 2a: Flow-chart for a pipeline segment from P6 processor core.
Pic. 2b: Flow-chart for a pipeline segment from NetBurst processor core.

Let’s discuss the operation of the NetBurst processor pipeline on the interval between the scheduler and the execution unit. From here on we will be using a simplified schematic representation of the pipeline with a few intermediate stages omitted.

The main task of the processor scheduler is to send commands for execution so that the execution units are always busy. The scheduler should send the commands for execution in such a way, that by the time the command arrives all operands have already been calculated. In case of NetBurst architecture, it takes the command more processor clock cycles to travel from the execution unit to the scheduler than the execution of any simple operation requires. Therefore, the next operation should depart for the execution unit before the previous operation has been processed and the result is ready. If the operation hasn’t been sent in advance, the entire processing will not be efficient enough.

In order to correctly calculate when the next operation needs to be sent out, the scheduler should predict when the data is going to be ready. The prediction should base on the time it took to complete all previous operations, which results were going to serve as operands for the current instructions. When the execution time for the operation is fixed (i.e. known in advance), the scheduling task can be solved easily. However, there are a lot of instructions which execution time is unknown in advance. Among these are such operations as loading from the memory (LD), for instance. The time it takes to complete this task depends on the location of the data in the memory subsystem or cache hierarchy. The time it takes LD command to load the data may vary from 2 two a few hundred clock cycles. Theoretically, the easiest way to schedule the commands with the unknown execution time implies that we take the worst latency expectation. However, when we have data loading from the memory, this number can reach hundreds of clock cycles (if we have to address the system RAM). As a possible solution, we can simply make the scheduler hold the ADD instruction depending on the result of the LD command until the data has already arrived. However, in a processor with a long pipeline this method will not be that efficient, because in this case the execution time for the L1 data loading command will be calculated as L1 cache latency + the distance to the scheduler in processor clock cycles (in the example above this will be the distance from the Scheduler to the ALU_Oper).

In order to ensure that all commands are processed efficiently, we should send the ADD command depending on the result of the LD data loading in the assumingly best moment of time, and this assumption should be based on the best latency estimates. At the same time we need some backoff mechanism, which would work in case L1 miss occurs. Otherwise, the ADD command may get wrong data and generate incorrect result or block the pipeline completely. If that happens we will have to halt not only the operation that causes problems, but the whole set of already sent dependent operations. The main problem here is the necessity to change quickly the internal structure of the scheduler where the operands status and dependencies details are stored. Besides, the scheduler queues should be long enough to accommodate all commands already sent for execution as well as all commands that have already arrived into the scheduler instead of the commands sent for execution. Since complex backoff mechanisms may cause operational problems at high working frequencies, and the scheduler queues in the Pentium 4 processor are not long enough, Intel engineers went for a compromise and developed a solution aka Replay.

First Look

Instructions can be executed incorrectly for multiple reasons. Besides the dependence on the results of the previous instructions, we could list the following external conditions: L1 cache miss, incorrect store-to-load-forwarding, hidden data dependencies, etc.

Let’s find out what the replay system looks like (Pic.3a, Pic.3b). The Scheduler output is connected to the Replay Mux. Then the operations from Replay Mux are sent to two pipelines. The first pipeline is the main one, it delivers the commands to the execution units. The second pipeline belongs to the replay system directly and contains empty stages, which do not do any specific work, but which number until the Check stage is the same as the number of stages in the main pipeline. The second pipeline receives exact copies of the operations going in parallel to the first pipeline.


Pic.3a


Pic.3b

The operations go along both pipelines in parallel until they reach the Check stage. Here the Checker unit verifies if the operation in the main pipeline has been executed correctly. If everything is alright, the operations retire (Pic.3a). if it turned out that the incorrect result was obtained for this or that reason (for example we got L1 Miss signal), then the chain of operations from the second pipeline is sent back to the Replay Mux through a replay loop (Pic.3b). At the same time (if we have an L1 cache miss, for instance) a request is sent to the next caching level (L2 cache). The replay loop may contain additional “fictitious” stages (on the picture they are marked as STG. E and STG. F). The number of these stages is adjusted so that the delay of the operation and pipeline for the complete loop could be just enough for the data to arrive from the new cache level (for example, the L2 cache latency, i.e. 7 clock cycles).

By the time the reversed command is expected to arrive at the replay mux, the Checker unit sends a special signal to the scheduler (stop signal) so that the scheduler could reserve a free slot in the next clock. The replay mux will insert the command returned for repeated execution into this slot. All commands, which depend on the incorrectly executed operation will also be returned for re-execution. Note that the distance between these commands equals fixed number of stages.

Here I have to stress right away that the commands can be sent to replay multiple times. For instance, the data from L2 cache can simply arrive too late, once a lot of repeated requests to the L2 cache occur. In this case one or two additional loops might be necessary, which will increase the L2 reading latency. For example, the data may arrive into L2 cache in 9 clock cycles instead of 7, and an additional loop will add at least 7 clock cycles to that. If the data is simply not in L2 cache at all, then the chain of commands depending on this data will be rotating in the replay system until the requested data arrives from the main memory.

Additional replay loops are one of the reasons why the actual L2 cache latency may turn out much higher than the number claimed in the official documents.

“Holes”

Since the distance between the operations returned to replay system remains the same all the time, sometimes there appear empty unoccupied staged between them. The Checker never sends the stop-signal to the scheduler for them. Let’s call these empty stages “holes”. The scheduler can send out a command for execution if the clock cycle coincides with a hole. This allows using the processor computational resources most efficiently, as we can mix together commands from different dependency chains. Let’s take a look at the following example:

LD R1, [X] // loading X into R1 register
ADD R1, R2 //1 – R1 = R1+R2
ADD R1, R2 //2 – R1 = R1+R2
ADD R1, R2 //3 – R1 = R1+R2
ADD R1, R2 //4 – R1 = R1+R2
ADD R1, R2 //5 – R1 = R1+R2
LD R3, [Y] // loading Y into R3 register
ADD R3, R4 //6 – R3 = R3+R4
ADD R3, R4 //7 – R3 = R3+R4
ADD R3, R4 //8 – R3 = R3+R4
...

We’ve got two chains of dependent commands: a chain of R1 register dependencies and a chain of R3 register dependencies. To simplify the example suppose that all commands of all types are sent to the same scheduler one by one, that is why the LD R3, [Y] command cannot be scheduled for execution before the fifth ADD R1, R2 command. Let’s take the L1 loading command latency equal to 2 clock cycles, and ADD latency – one clock cycle.

Here are two cases to be considered:


Pic.4a


Pic.4b

  1. X and Y values are in L1 cache (L1 hit, Pic.4a). No surprises here. No commands go to the replay system, all commands retire one by one.
  2. X is not in L1 cache, but it is found in L2 cache (L1 miss). Y is in L1 cache (Pic.4b). This is a much more interesting case. Until the 6th clock cycle the scheduler sends all commands out for execution keeping in mind their estimated latencies. The first LD command reaches the Check stage at 6th clock cycle on the second pipeline, receives the L1 cache miss signal and turned to replay (at the same time X read request for L2 cache is generated). All next 5 ADD commands will not receive a correct operant and will follow the first LD command to the replay system. By the time the first LD command reached the replay mux, the scheduler will have already sent the second LD command after ADD5in 8th clock cycle, and at that moment it also received the signal to leave one slot free for the first LD command in the next clock cycle. At 9th clock cycle the first LD command reaches the replay mux and is resent. At 10th clock cycle replay loop delivers no commands, that is why the scheduler has to fill in the “hole” with another ADD6 command waiting for the execution resources. In the next clock cycles, the ADD1-ADD5 commands following the first LD are resent for re-execution. After the last ADD5 command, the scheduler will be able to fit in ADD7 and ADD8 into the available slots. At 14th clock cycle the first LD command will finally receive the data from L2 cache and will be executed correctly, so the first LD and the following ADD1-ADD5 commands will get executed and will retire.

This example shows that the scheduler tries to use the computational resources more efficiently by using the “hole” between the LD and ADD1 commands. It inserts ADD6 command from the independent succession there. Unfortunately, there is another side to this hunt for efficiency. Let’s talk more about it now.

Infinite Replay Looping

In the previous example we discussed the situation when a command from an independent chain fits into the “hole” between the dependent commands. Unfortunately the scheduler of contemporary NetBurst processors is not “smart enough” to track down the entire chain of dependent commands sent to replay. In other words, the scheduler always considers the operation to be performed correctly. Now let’s take a closer look at the situation when a command from the same dependency chain fits into the “hole”. This matter is very important for the proper performance estimates, because the indirect dependency chains can be very frequently found in programs.

As the simplest example here I would like to offer you a Total Sum calculation, when the result of every next operation depends on the result of the previous one directly.

for (int I = 0; I< count; I++)
  sum+=array[I];

Well, let’s analyze how the replay works with the help of this example of a long data dependency chain (see Pic.5 below).

LD R1, [X] // loading X into register R1
ADD R1, R2 //1 – R1 = R1+R2
ADD R1, R2 //2 – R1 = R1+R2
ADD R1, R2 //3 – R1 = R1+R2
ADD R1, R2 //4 – R1 = R1+R2
ADD R1, R2 //5 – R1 = R1+R2
ADD R1, R2 //6 – R1 = R1+R2
ADD R1, R2 //7 – R1 = R1+R2
ADD R1, R2 //8 – R1 = R1+R2
ADD R1, R2 //9 – R1 = R1+R2
  ………

Just like in the previous example (Pic.4b) we will take the L1 load latency equal to 2 clock cycles, and the ADD latency – to 1 clock cycle. X is not in L1 cache (L1 miss), but is found in L2 cache.


Pic.5

In the model on Pic.5 the scheduler sends a chain of commands to the pipeline without any problems keeping in mind their estimated execution latency. Just like in the example on Pic.4b LD command reaches Hit/Miss stage on the first pipeline at 6th clock cycle, where L1 Miss occurs, and its copy on the second pipeline reaches the Check stage where it receives an L1 miss signal and turns to replay. All the following ADD commands cannot be executed correctly, so they also go to replay. At 8th clock cycle LD command reaches the replay mux when the signal arrives that it should stop and let one slot free for the LD command, so replay mux resends LD at the next 9th clock cycle. After that, beginning with the 10th clock cycle, the negative consequences of the replay start popping up. The scheduler cannot track down the entire dependency chain beginning with LD that is why it inserts ADD7 into the “hole” between LD and ADD1 at 10th clock cycle. Of course, ADD7 cannot be executed correctly before ADD6 is, but in reality ADD7 appears ahead of ADD6 in the pipeline. That is why when LD operation retires at 14th clock cycle after obtaining the correct data, while ADD7 turns to replay and so do ADD8, ADD9 and all other commands following them. There is a big “hole” between ADD7 and ADD8 commands, where other commands will fit, and so on and so forth. Moreover, these “holes” can be created by the commands from other chains, which the scheduler sent for execution during commands rearrangement.

So, we see the following picture: the scheduler doesn’t know what commands have been executed correctly, but it continues sending new commands from the dependency chain for execution inserting them into “holes” between commands sent to replay. Since not all the previous commands have been executed correctly, the commands in the “holes” will also be executed incorrectly and will have to get back to replay. All commands of this chain will keep rotating in the replay loop one by one like links of the chain circled around a rod. If we create a time diagram representing the Dispatch stage at each given moment of time, it will look as follows:


Pc.6: The chain of commands getting into the replay system.

The new commands will keep getting into replay loop until the dependency chain ends. The negative influence of the replay on the rest of the chain results not only into the increased operations latency, but also into absolutely inefficient waste of computational resources, because all the operations in the replay system have to be executed at least twice: first time before and second time after the replay. And as we have already pointed out, the commands can be re-executed a few times while waiting for the data to be ready, so the resources workload may grow more than twice.

In real applications the loops of dependent commands may disappear, if some events cause pipeline clearing, or the “holes” will be “patched”. Here I mean that there are enough commands from independent chains fitting into the “holes” between the operations sent back to replay (like in the example on Pic.4b). However, it is extremely hard to predict in the program code when “patches” like that might be needed, since the final decision about the order of commands execution depends on the scheduler, which may not necessarily “patch the hole” in the right moment of time.

Moreover, you know already that Pentium 4 processor features 5 schedulers. Each scheduler has its own replay system, so the commands of different types circle inside different replay loops thus increasing the number of “holes” in each of them, which automatically increases the probability of the dependent commands looping.

As we managed to find out, the “holes” between commands resent for repeated execution, which create long-lasting replay looping, are the major reason why the MOV EAX, [EAX] succession dependency chain will not help during the L1 cache latency measurements in the Pentium 4 processor. The existence of “holes” also explains the graph we offered you in the very beginning of this article. It turns out that once the commands from the dependency chain fall into the “holes” and worst comes to worst, they start lopping more than actually necessary thus increasing the execution latency tremendously. The number of these “loops” they make depends on the combination of ‘holes”, load commands and the number of commands between loads.
Our investigation used undocumented command counters for the operation sent to replay. Besides, we also developed special tests where we arranged the commands between the loads in such a way that the created “patches” wouldn’t let the commands from the dependency chain to get into the “holes” between the replayed commands. Our results proved the theory: if the “holes” are “patched” in time with other commands the succession will be protected against looping and the program code will be executed much faster, and the measured cache latency will match the values claimed in the official documents.

This way, the usual way of calculating the latency doesn’t work for NetBurst architecture (such as the latencies in case of L1 cache miss, for example). The actual number will include not only the latency itself, but also the additional delay caused by replay looping. The looping may actually take quite long, so the effective latency may reach hundreds of clocks instead of only 9. The worst thing however, is that replay delays the execution of not just once instruction, but blocks some execution resources, that could have been used for other independent operations in the meanwhile.

Additional Replay Loops

L1 cache miss is the major but not the only event causing replay. There is the whole bunch of other events (note that this list is incomplete):

These events happen at different pipeline stages. The time required for the correct data to arrive may also differ. To solve this problem additional replay loops get involved.

When we were investigating the way latencies change in Pentium 4 Northwood processor, we suddenly discovered that some events (such as L1 miss L2 hit; L1 64KB aliasing) cause latency equal to the product of fill replay loop (7 clocks) length multiplied by number of passes + 2 clocks (L1D cache access latency). The latency of other events (such as L1 hit DTLB miss; L1 1MB aliasing) is divisible by 12+2 clocks. This indicates that there exists one more replay loop, a wider one. To simplify our further explanations let’s call the replay loop with the 7-clock rotation cycle RL-7, and the replay loop with a 12-clock rotation cycle – RL-12. Now let’s find out what’s the difference between them.

Let’s take a look at a pretty frequently occurring situation: the line with requested data is in L1, but there is no record in DTLB about the memory page for this line (L1 hit, DTLB miss).


Pic.7: Flow-chart illustrating the work of a replay system with two replay loops.

The chain of commands with LD at the head is moving along the pipeline. When they reach CacheRead LD operation initiates a request to the L1 cache controller and at Hit/Miss stage received L1 Hit signal, which means that the tag for the requested line was found in L1. The parallel EarlyCheck stage received the signal about successful LD execution, so it didn’t turn LD for re-execution and allowed it to keep going down the pipeline together with the dependent commands following it. L1 cache of the Northwood processor is designed in such a way that its tags are viewed faster than the translation buffers where the virtual addresses are transformed into physical ones (TLB). And the TLB size in records is considerably smaller than the L1D cache size in lines. That is why if the corresponding record for the requested page cannot be found in the DTLB, LD command will get a “nice” surprise. A few stages later, when LD reaches the LateCheck stage, DTLB miss event takes place. This is when LD command will have to turn to replay, no matter how disappointing this may look. All commands depending on the result of LD will follow it into replay at the same LateCheck stage. While the commands are being turned to replay system at the LateCheck stage, force-early-replay-safe signal is sent to EarlyCheck stage. Its major function is to prevent any incorrectly executed command to be turned to replay at the EarlyCheck stage, so that two commands couldn’t arrive at the replay mux simultaneously and cause a conflict. However, if the command received a force-early-replay-safe signal, it doesn’t at all mean that the command has been executed correctly. All it means is that it will be sent to replay only at the LateCheck stage. LateChecker unit can process the whole lot of replay events (including those processed by EarlyChecker). Therefore, if the command had to be sent for re-execution at an EarlyCheck stage (for instance, it received L1 miss signal together with the force-early-replay-safe signal), it would be definitely turned to replay at the LateCheck stage then.

Here the following question arises: why do we need two replay loops, RL-7 and RL-12, working in parallel in each computational channel, if all commands can be turned to replay at RL-12? The answer is actually quite simple: the processor tries to perform speculative re-execution. If the LD command is turned to replay after L1 miss at an early stage, it will be re-executed sooner and the latency will be lower if the data is successfully found in the L2 cache. This way, RL-7 serves as an auxiliary loop reducing the latency in some cases.

In Prescott based processors the loop structure has been slightly changed. Since the L1 cache latency grew up to 4 clocks, the stages when L1 miss and DTLB miss events may occur coincide. Since the L2 cache latency grew up to 18 clocks, there is no further need to re-execute the LD command sooner. That is why Prescott processor core features only one type of replay loops: RL-18 (18 clocks per loop). And our tests prove this.

Replay at FPU Pipeline

The replay mechanism in the FPU pipeline works according to a different algorithm than the ALU replay. It looks like there is a sort of feedback between the data loading unit and the scheduler. Once the L1 data cache has been checked for data availability and the data has been found there, the scheduler sends the dependent instruction further. So, if the data is reported missing in L1 data cache (such as RL-7 loop for ALU loading), FP-load where x87, MMX, SSE and SSE2 belong, is replayed, but the dependent instructions do not get issued. For RL-12 there is no difference in this case: FP operations are circling in the RL just the same way. If the data is found in L1 cache, the latency of FP-load operations is 9 clock cycles. If the data is not there, we add n*7 or n*12 clock cycles depending on the situation. In fact, we failed to send any chain of FP-operations to RL-7 at all. For example, if there is an Int-chain circling around RL-7, then the dependent FP-chain will get onto RL-12. For instance, two instructions “MOVD MM0,EAX – MOVD EAX,MM0” transfer the Int-chain from RL-7 to RL-12 (EAX dependency).

Why so and not the other way around? We assume that most instructions going via FP Move actually go through something like the “Convert & Classify” K8 unit, where the result is translated into a certain internal representation form (formatting). This hypothesis is proven by the following facts:

Maybe most FP Move operations are none other but more or less fixed pairs of primitive commands like “load + convert” or “convert + store”, where the ‘convert” part takes about 6-7 clock cycles. Speaking about replay again: in this (hypothetical) case the time required for “convert” execution exceeds the “distance” in clock cycles between the scheduler and the execution unit. So, the scheduler can safely send the dependent operation further according to the first check result. In case of failure, only the “load + convert” pair will need to be replayed.

STLF Violations

Those of you who have studied NetBurst micro-architecture carefully should know that Store operation is split into two quasi-independent micro-operations: Store data (STD) and Store Address (STA). The results of these two micro-operations are combined in the Store Buffer (SB). The data stays in the SB until the record command retires. After that they are saved in the cache and RAM via the intermediate Write Buffer, where the modified data is put back together. Data load commands perform speculative reading of the data which is not in cache yet directly from the SB. This process is called store-to-load-forwarding (STLF).

In order to the STLF to end up successfully, certain conditions should be fulfilled:

The violation of any of the above mentioned conditions may lead to very unpleasant delays.  The re-execution of data load command, which couldn’t be completed successfully because of the STLF violation is also carried out via the replay system.

Well, in order for Store to be able to send the data for Load command, SB should have the STA and STD results ready in advance. Although “IA-32 Intel Architecture Optimization Reference Manual” classifies this condition as “Store Forwarding Restrictions”, we should realize that it requires specific processing and leads to specific consequences. The true STLF violation, when the data already located in the SB cannot be sent, results into a significant delay: store and all preceding instructions should retire first and the store result should be saves in the cache. In our case, i.e. during STLF Restriction on Data Availability, we should only wait for the STA/STD result. As you may have already guessed, replay works here: LD and all dependent instructions are sent to RL and circle there until the Store results arrive.

We have just studies two types of STLF violations: when by the time Load should be executed either STA or STD result is not ready yet (the third type, when none of the results is ready will be determined by the worst consequences of the first two types).

We paid special attention to them, because they are really hard to foresee in the program code, unlike the data size change. During the tests we carried out for the STLF violations, we discovered that they can cause re-execution within the RL-7 as well as within the RL-12, depending on the type of violation. When STD is not ready, RL-7 starts working, and when STA is not ready – RL-12. In fact, this is fairly simple: when we are executing the Load command, we search for the data in the Store Buffer; once a position with the coinciding address is found, the possible data absence is detected immediately. However, if the data is there but the address hasn’t arrived yet, the CPU can only hope for the best and assume that this Store will not affect the Load. Later on it will perform a Check and in case of failure the operation will be sent to RL-12.

In order to ensure that LD is executed correctly from the very beginning, it should be sent for execution not any earlier than STA and STD. note that it should be sent at least 3 clocks later than STA. The processor will not allow Load to be executed ahead of STA, however, 3 clock cycles is a pretty big gap, so the Load queue scheduler will try to take advantage of it, because it doesn’t know anything about the hidden dependence of the addresses on the Store.

Pentium 4 architecture provides excellent opportunities for preliminary date loading: this is where the address operation queues come in handy. Therefore, you cannot guarantee that STLF violations will be avoided even if you try to adjust the algorithm implementation. Any pair of dependent Store-Load operations in the “window” of a few dozens of instructions is a potential hazard.

One of the most illustrative examples here could be the function calls we see in all program codes. The function call is usually performed by a chain of stack parameter store commands – PUSH – and the actual CALL command.
…. 
PUSH EAX
CALL Func 
………
Then the registers are saved and the parameters are read within the function call.

PUSH ESI 
MOV EAX, [ESP+8] 
………
You can notice that the PUSH EAX and MOV EAX, [ESP+8] pair is a potential cause of replay in case we have STLF Restriction on Data Availability: when STD is not ready (the procedure receives a result of long calculations) as well as when STA is not ready (if the scheduler sends MOV EAX,[ESP+8] for execution less than 3 clocks after PUSH EAX).

Livelocks

Here we would like to dwell on one more not very evident problem caused by replay. When we discussed the replay working principles, we mentioned a few times that the scheduler halts the work when the command turned to replay approaches replay mux. Here it is quite possible that the entire chain of commands send by the scheduler to the pipeline before the work stopped, will be turned to replay. In this case the entire chain of re-executed commands will keep circling in the replay system until the requested data arrives. And have you ever asked yourself what will happen if the instruction to be executed for the successful execution of the entire chain currently in replay is still in the scheduler and cannot be released because the entire pipeline is full? This situation is called livelock. An example of a situation like that can be the case when TSD is not ready during store-to-load-forwarding. Here is what it will look like in the program code:

IMUL EAX
MOV [ESI], EAX
MOV EBX, [ESI]
14*{ AND EBX, EBX } // and ebx, ebx, repeated 14 times

The store in memory command is split into quasi independent micro-operations aka STA and STD. these micro-operations get into different schedulers: STA gets into the same scheduler as LD command, and STD gets into the same scheduler as AND command. This distribution of micro-operations is a perfect cause for a livelock. This is how it happens:


Pic.8: Livelock

Of course, the developers didn’t intend to design a processor, which could be locked dead with a certain chain of commands: this problem had to be solved. In fact, our experiments showed that the code execution in the examples, such as the one above, will resume in a few dozens of clock cycles. We have to confess that we do not know all the details about the pipeline clearing mechanisms, but one of the possible solutions will be discussed in detail in the next chapter.

Forced Replay Quit Mechanism

As we have just seen, we need special additional logics capable of resolving hopeless conflicts like that, if we want to keep our processor working at least at the minimum acceptable level. To be more exact, we need two mechanisms. The first one, “watching mechanism”, will detect problems like livelocks, and the second one, “acting mechanism”, will push the system out of trouble. If these mechanisms exist, they should evidently work not only in completely hopeless situations, such as livelocks, but also in some more neutral cases. Well, moving from theoretical discussions to the practical tasks, we decided to test the system with the help of a chain with a “hole” by gradually increasing the number of commands in it.

        {xor eax,eax}
256* {and eax,eax}
        {mov eax,[eax+esi]}     //   L1 miss, L2 hit
        {and eax,eax}
        {add eax,eax}  //   “hole”
   N* {and eax,eax}

Let’s take a look at the diagram reflecting the difference (in clock cycles) between the execution latencies in case “L1 miss, L2 hit” and “L1 hit” depending on the queue length (Pic.9a). We can see that the system behaves pretty predictably if the chain of commands is relatively short: our rough estimates of the latencies are close to the theoretical values. While we are increasing the number of instructions in the dependency chain of AND commands, we increase the number of replay loops: after every 14 chains the chain latency jumps 7 clocks up. And when we reach N=82 we suddenly see an unexpected delay (for about 22 clock cycles), which we can hardly explain at first sight. To figure out what caused this delay we involved Pentium 4 performance counters, which include a special group of counters checking the replay events. We analyzed the performance counters calculating the  number of instructions sent to replay and found out starting with N=87 the instructions stop coming to replay (Pic.9b).


Pic.9a: Relative execution time for the chain of commands
with one “hole” depending on the length of this chain.


Pic.9b: The number of replayed commands from the chain
with a “hole” depending on the length of this chain.

The delay should be somehow connected with the forced replay exit. But what mechanism stops the chain?

Suppose that the process stalls naturally, i.e. without any external influence. For example, if the internal processor resources get overloaded. The question here is what these resources are, because there is quite a big reserve of the main resources (ROB, internal registers pull, etc.), which allows the CPU to survive much heavier stress. We will not completely disregard this possibility so far, but will keep in mind that it will not be the solution for the livelock: the problem will have to be solved in some other way in this case.

Let’s approach the issue from the other end. The simplest and most radical way of resolving the issue is to clear the pipeline. In this case we would expect the counters to detect the replayed chain of instructions immediately, however, the experiments didn’t show anything like that. In fact, nothing to be surprised at: clearing the pipeline is a pretty rough measure here.

If we consider our chain of commands with “holes”, the solution could be quite simple: we should prevent the scheduler queues from sending out new commands for a while. This is a pretty natural solution, because the issue actually arises from the fact that the scheduler has no idea of what is going on in the RL and keeps sending there new instructions. However, this method is not exactly the best way to solve livelocks: it represents just the opposite of what you should do in this case. Look, all we need is to somehow fit an instruction stuck in the scheduler into the RL, and right now all the attempts discussed above result into complete isolation of this instruction. Besides, you can also see that if the chain keeps growing, there will be at least 5 additional instructions in the replay system, and this is not what we suppose it should be.

Well, there is one more option left, which we need to check out: locking the scheduler Input. Since it will take another huge article to reveal all the smallest details of this, we will only provide you with the test results and conclusions drawn from them.

We managed to find out that there is a moment when only 8 instructions can be sent for execution (with further replay), which matches exactly the schQ FAST_0 scheduler queue depth.

The conclusion is evident: we deal with the locking of the scheduler input. This is a very important statement, as this mechanism is a pretty logical basis for the unified universal system used to resolve livelocks as well as “long chain” issues. The tests prove that once the scheduler input is closed, it locks not only the schQ FAST_0 queue, but also all other schedulers. We can see it from the limited number of instructions getting into replay from these schedulers once their input has been locked. The price we have to pay for this stall operation can roughly be estimated as 15-35 clock cycles.

The system would watch for a while the share of the replayed instructions, new instructions and free positions (let’s call them patterns). However, all we know now doesn’t allow us to fully reproduce the exact decision making algorithm about the stalling. Our experiments shows that there is a serializing event that serves as a starting point, that is why if we take the test code of our example as is but slightly modify the calling procedure, the maximum number of replayed instructions may turn out different.

Note that long chains are not stalled globally. For example a chain of dependent shift will never be stopped. The same is true for many other instructions. In other words, no stops is more of a rule than of an exception. The model “chains with holes” are always stopped when they get into RL-7, but in some cases when they get into RL-12, the “holes” keep looping infinitely. The looping of our model chains results into ideal identity of the pipeline status every 14 clocks, which cannot happen in real-life program codes. So, we get the impression that we are dealing not with the mechanism for resolving this difficult situation, but with a system initially intended for other tasks. And this system gets involved here only in specific favorable conditions.

We believe that this mechanism is intended primarily for resolving livelock issues. If you have been reading our article carefully, you might ask: how can the blocked scheduler input help get out of a livelock? We think it works like that: when the scheduler input is closed, the chain of commands circling in the replay loop is sent to some special buffer and the free positions in the pipeline get occupied by the micro-operations left in the scheduler. These micro-operations may be executed successfully, thus resolving the livelock.

Recording Data into Write Buffer

The story here is fairly funny: it all started with our discovery of absolutely new replay loops, which appeared out of nowhere. Of course, we had to find out where they actually come from and why. However, at first the whole investigation went on really slowly, as the loops emerges often and then disappeared for some unknown reasons. Our detailed investigation revealed the connection with the data recording into Write Buffer. Let’s discuss a few details about it.

The L1 cache of Pentium 4 processor supports Write Through data update policy, which implies that the contents of both: L1 and L2 cache should be updated simultaneously. Therefore, the record is actually made into L2 cache, while the results of multiple operations are accumulated in WB first, so that the transfer to L2 cache could be performed within the minimal number of transactions. When the line absent in WB has been recorded, it gets blocked for reading for a while.

For the data reading from Write Buffer to be blocked, a few conditions should be fulfilled:

Note that the addresses for read and write operations may be different: they should only belong to the same line. According to the test results we obtained, if we read the first half of the line (0.31 bytes), the Load command will go to replay within 21-33 clock cycles after Store. And if we read the data from the second half of the line (32-63 bytes), Load command will go to replay within 21-34 clock cycles after Store. Only one clock cycle difference for the lower and higher 256 bits of the line indicates that the modified data is combined with the non-modified data from the cache within these periods of time.

Pentium 4 Northwood processor features Write Buffer with 6 lines 64 bytes each. It is not too much that is why if the read and write operations go close to one another, we will experience not only recording delays. It will be impossible to read saved data from Write Buffer immediately, which will also cause delays and send read commands through the replay system.

Replay Influence on Hyper-Threading

The major goal of the Hyper-Threading technology is to increase the efficiency of the computational resources usage due to the fact that two threads do not share any dependent data and one of the threads can use those resources that are not occupied by the second thread, especially when the latter is idle. It usually takes processors quite a lot of time to wait for the data to arrive from RAM, so this is when the resources can be idling for quite a while. So, why not let the other threads use them up, if they are not waiting for any data from the system RAM?

As we have already described above, if a command is looping in the replay system for a long time, it may cause unjustified waste of the computational resources. For instance, if the data is not in the L1 and L2 caches, the chain of commands will have to make tens and maybe hundreds of loops in the replay system, occupying the computational resources for nothing while waiting for the data to arrive from the main memory. If NetBurst processor is single-threaded, then waiting for the data from the RAM will hardly cause any serious performance issues in the replay system, because the CPU will anyway have to wait for this data thus losing hundreds of clock cycles (the processing stops until the data arrives).

In this case the work of additional processor unit will have more effect on the processor heat dissipation rather than performance :) But when there are two simultaneously processed threads, the inefficient use of the computational resources by one of the threads in the replay system simply cannot remain unnoticed for the performance of the other. I dare suppose that the more often the thread requests data absent in the L1 and L2 caches, the more resources the replay system will eat up while waiting for the data to be delivered.

We decided to check if this theory is true. We wrote a program where one thread had a long chain of data dependent commands and requests data from the system memory at random addresses all the time, while the other thread simply carries out the calculations in the registers hardly addressing the memory at all. Both threads execute the same type of commands (AND) on the same FastALU0. The main goal of this experiment was to check how the performance of the second thread not working with the memory is going to change depending on the location of the data requested by the second thread: L1 cache, L2 cache or system RAM. The results of our tests for Pentium 4 Northwood processor are given on Pic.10.


Pic.10: Replay influence on Hyper-Threading (Northwood CPU).

On the picture above (Pic.10) you can see the dependence of the second thread performance (Thread 2) on the size of the data buffer of the first thread (Thread 1), which requests data at pseudo-random addresses.

The results are more than illustrative. While one thread is waiting for the data to arrive from the memory, it slows down the processing speed of the second thread (>35% compared with the situation when the data is expected from L1 cache). The thread expecting the data from RAM occupies the resources even more during the wait period than it would during the regular execution when the data is available in L1 cache. The situation with HT is aggravated by the fact that the two threads share the L1 and L2 caches capacity, which makes the efficient size of the cache memory for each thread twice as small. This in its turn means that the amount of cache misses increases, and so does the number of replay cases. And this means in its turn that the performance of both threads lowers. Replay could be one of the reasons why enabled HT may turn out harmful for the performance in certain tasks.

Now that we figured out what’s happening with the results Pentium 4 on Northwood core demonstrates, we decided to test a Pentium 4 processor on the new Prescott core, especially since Intel claimed that they had enhanced HT technology there. The results of the tests didn’t disappoint us (see the influence of the number of cache-misses and replay cases of one thread on the performance of the other):


Pic.11: Replay influence on Hyper-Threading (Prescott CPU).

The influence of the replay system on the performance not just got smaller: it turned completely different. Firstly, the thread performance is now always higher if HT technology is enabled. Secondly, if the data is not in the L1 and L2 caches, the performance of the second thread turns out somewhat higher than in case the data is available in the L2 cache.

Replay Queues

Why so? Since all official manuals wouldn’t reveal any details about the reasons behind the observed phenomena, we had to search through Intel patents ourselves. We managed to find a very interesting patent called “Multi-threading for a processor utilizing a replay queue”, which seems to be shedding some light onto the results we keep getting. Let’s take a brief look at replay queue and its main working principles.


Pic.12: Flow-chart illustrating the work of the replay system with Replay Queues.

In the system presented on Pic.12 you can see the commands turned to replay from the Check stage fall into Replay queue loading controller, which is responsible for making decisions about the further fortune of commands coming to replay. If the load command could not be executed correctly because of the L1 cache miss, it has the chance to try getting the data from the L2 cache, and the controller returns this command into the replay mux through the replay loop. If L2 cache miss occurred during the command re-execution attempt, the replay queue loading controller received L2 Miss signal and sends this command into the replay queue to prevent it from wasting the execution units resources while waiting for hundreds of clocks before the requested data arrives from the RAM. The controller also sends there all incorrectly executed commands following the first one in the program code, as there can be some hidden data dependencies, which could have caused incorrectly executed commands to turn to replay. Since a CPU with Hyper-Threading technology can have two independent threads processed at a time, there should be two independent queues for the commands of these two threads, which will be processed in parallel.

The commands are resend from the Replay queue for re-execution only when the Replay queue unloading controller receives a Data return signal, meaning that the data has arrives from the system RAM. The Replay queue unloading controller releases the commands for both queues hoping that they will be executed successfully. The choice of the replay mux input to receive the next command (from one of the replay queues, replay loop or scheduler) is left for the priority system. There can be different priority systems involved: fixed (the priority stays with the thread 0, then thread 1, then the replay loop and at last the scheduler) or aged (the priorities are given to instructions according to the order they arrived to the scheduler). Unfortunately, we don’t know for sure which system is used in the Prescott processors.

Besides the above described attempts to prevent the chain of commands waiting for the data to arrive from RAM from wasting the computational resources, replay queues can also serve to combat livelocks, which we have already discussed above. We assume that this buffer may be used for commands, which got stuck in the replay system of the Northwood processor.

It is quite possible that Prescott processor has exactly the replay queues system described in the “Multi-threading for a processor utilizing a replay queue” patent. At least we see it from the significant difference in test results compared to Northwood. However, we can’t help pointing out that the performance of the second thread in Prescott still slows down by about 20%, if the first thread working with the commands of the same type suffers a lot of L1 cache misses.

Conclusion

Summing up I can say that replay is not an independent trait of the NetBurst architecture, intended to increase the processors performance. Replay is more likely to be regarded as “the other side to the picture” of the longer pipeline, as an auxiliary mechanism intended to help resolve speculation issues. The performance drop caused by replay is the price we have to pay for high working frequency. That is probably why we can rarely come across some scarce mention of the replay system in Intel’s official manuals and docs. Replay very often causes unjustified waste of resources and significant performance drops.

Since there is no description of replay causes and consequences in the official documentation, many software developers simply have no idea that it exists and thus cannot optimize their programs for it. It is definitely good news that Prescott processors acquired replay queues reducing the negative influence of the replay on two parallel working  threads when HT technology is enabled, however, even in this case we see a 20% performance drop in the second thread cause by the replay of the first one.