Articles: CPU

Bookmark and Share

Pages: [ 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 ]

What Do We Need the Replay For?

The main feature of the NetBurst pipeline distinguishing it from the Intel P6 and AMD K7/K8 pipelines is the appearance of a few additional stages between the scheduler and the execution unit.

Pic. 2a: Flow-chart for a pipeline segment from P6 processor core.
Pic. 2b: Flow-chart for a pipeline segment from NetBurst processor core.

Let’s discuss the operation of the NetBurst processor pipeline on the interval between the scheduler and the execution unit. From here on we will be using a simplified schematic representation of the pipeline with a few intermediate stages omitted.

The main task of the processor scheduler is to send commands for execution so that the execution units are always busy. The scheduler should send the commands for execution in such a way, that by the time the command arrives all operands have already been calculated. In case of NetBurst architecture, it takes the command more processor clock cycles to travel from the execution unit to the scheduler than the execution of any simple operation requires. Therefore, the next operation should depart for the execution unit before the previous operation has been processed and the result is ready. If the operation hasn’t been sent in advance, the entire processing will not be efficient enough.

In order to correctly calculate when the next operation needs to be sent out, the scheduler should predict when the data is going to be ready. The prediction should base on the time it took to complete all previous operations, which results were going to serve as operands for the current instructions. When the execution time for the operation is fixed (i.e. known in advance), the scheduling task can be solved easily. However, there are a lot of instructions which execution time is unknown in advance. Among these are such operations as loading from the memory (LD), for instance. The time it takes to complete this task depends on the location of the data in the memory subsystem or cache hierarchy. The time it takes LD command to load the data may vary from 2 two a few hundred clock cycles. Theoretically, the easiest way to schedule the commands with the unknown execution time implies that we take the worst latency expectation. However, when we have data loading from the memory, this number can reach hundreds of clock cycles (if we have to address the system RAM). As a possible solution, we can simply make the scheduler hold the ADD instruction depending on the result of the LD command until the data has already arrived. However, in a processor with a long pipeline this method will not be that efficient, because in this case the execution time for the L1 data loading command will be calculated as L1 cache latency + the distance to the scheduler in processor clock cycles (in the example above this will be the distance from the Scheduler to the ALU_Oper).

In order to ensure that all commands are processed efficiently, we should send the ADD command depending on the result of the LD data loading in the assumingly best moment of time, and this assumption should be based on the best latency estimates. At the same time we need some backoff mechanism, which would work in case L1 miss occurs. Otherwise, the ADD command may get wrong data and generate incorrect result or block the pipeline completely. If that happens we will have to halt not only the operation that causes problems, but the whole set of already sent dependent operations. The main problem here is the necessity to change quickly the internal structure of the scheduler where the operands status and dependencies details are stored. Besides, the scheduler queues should be long enough to accommodate all commands already sent for execution as well as all commands that have already arrived into the scheduler instead of the commands sent for execution. Since complex backoff mechanisms may cause operational problems at high working frequencies, and the scheduler queues in the Pentium 4 processor are not long enough, Intel engineers went for a compromise and developed a solution aka Replay.

Pages: [ 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 ]


Comments currently: 251
Discussion started: 06/08/05 05:25:05 AM
Latest comment: 10/21/16 12:13:46 PM

View comments

Add your Comment