Articles: CPU

Bookmark and Share

Pages: [ 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 ]

Forced Replay Quit Mechanism

As we have just seen, we need special additional logics capable of resolving hopeless conflicts like that, if we want to keep our processor working at least at the minimum acceptable level. To be more exact, we need two mechanisms. The first one, “watching mechanism”, will detect problems like livelocks, and the second one, “acting mechanism”, will push the system out of trouble. If these mechanisms exist, they should evidently work not only in completely hopeless situations, such as livelocks, but also in some more neutral cases. Well, moving from theoretical discussions to the practical tasks, we decided to test the system with the help of a chain with a “hole” by gradually increasing the number of commands in it.

        {xor eax,eax}
256* {and eax,eax}
        {mov eax,[eax+esi]}     //   L1 miss, L2 hit
        {and eax,eax}
        {add eax,eax}  //   “hole”
   N* {and eax,eax}

Let’s take a look at the diagram reflecting the difference (in clock cycles) between the execution latencies in case “L1 miss, L2 hit” and “L1 hit” depending on the queue length (Pic.9a). We can see that the system behaves pretty predictably if the chain of commands is relatively short: our rough estimates of the latencies are close to the theoretical values. While we are increasing the number of instructions in the dependency chain of AND commands, we increase the number of replay loops: after every 14 chains the chain latency jumps 7 clocks up. And when we reach N=82 we suddenly see an unexpected delay (for about 22 clock cycles), which we can hardly explain at first sight. To figure out what caused this delay we involved Pentium 4 performance counters, which include a special group of counters checking the replay events. We analyzed the performance counters calculating the  number of instructions sent to replay and found out starting with N=87 the instructions stop coming to replay (Pic.9b).

Pic.9a: Relative execution time for the chain of commands
with one “hole” depending on the length of this chain.

Pic.9b: The number of replayed commands from the chain
with a “hole” depending on the length of this chain.

The delay should be somehow connected with the forced replay exit. But what mechanism stops the chain?

Suppose that the process stalls naturally, i.e. without any external influence. For example, if the internal processor resources get overloaded. The question here is what these resources are, because there is quite a big reserve of the main resources (ROB, internal registers pull, etc.), which allows the CPU to survive much heavier stress. We will not completely disregard this possibility so far, but will keep in mind that it will not be the solution for the livelock: the problem will have to be solved in some other way in this case.

Let’s approach the issue from the other end. The simplest and most radical way of resolving the issue is to clear the pipeline. In this case we would expect the counters to detect the replayed chain of instructions immediately, however, the experiments didn’t show anything like that. In fact, nothing to be surprised at: clearing the pipeline is a pretty rough measure here.

Pages: [ 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 ]


Comments currently: 25
Discussion started: 06/08/05 05:25:05 AM
Latest comment: 08/25/06 02:33:35 PM

View comments

Add your Comment