Sideband Stack Optimizer
Decoder schemes of the new K10 processors acquired a special block called Sideband Stack Optimizer. Its working principle is similar to that of the new Stack Pointer Tracker unit that is employed in Core processors. What do we need it for? x86 system uses CALL, RET, PUSH and POP instructions to call a function, retire a function, transfer parameters of a function and save register contents. All these instructions use, though not directly, the ESP register indicating the current stack-pointer value. When you call a function in K8 processor you can follow the execution of these instructions by representing their decoding as a succession of equivalent elementary operations modifying the stack register and load/save instructions.
Instructions | Equivalent operations |
// func(x, y, z); |
|
push esi | sub esp, 4; mov [esp], esi |
add esp, 12 | add esp, 12 |
As you can see from this example, when the function is called, instructions sequentially modify the ESP register, so each next instruction is implicitly dependent on the result of the previous one. Instructions in this chain cannot be reordered that is why the execution of the function body starting with mov eax, [esp+16] cannot begin unless the last PUSH instruction has been executed. Sideband Stack Optimizer unit tracks the stack status changes and modifies the instructions chain into an independent one by adjusting the stack offset for each instruction and placing sync-MOP operations (top of the stack synchronization) in front of the instructions that work directly with the stack register. This way instructions working directly with the stack can be reordered without any limitations.
Instructions | Equivalent operations | |
// func(x, y, z); | mov [esp-4], X | |
push esi | mov [esp-20], esi |
|
| add esp, 12 | sync-MOP |
mov eax, [esp+16] instruction that starts the calculations in the function body in our example depends only on the sync-MOP operation. Now these operations can be performed simultaneously with other instructions before them. This way the parameters passing and register saving happen faster and the function body can start loading these parameters and working with them even before all of them have been passed successfully and before the registers saving has been complete.
So, faster stack operations decoding, Sideband Stack Optimizer unit, deeper return-address stack and successful prediction of indirect alternating branches make K10 much more efficient for processing of function-rich codes.
K10 processor decoder will not be able to decode 4 instructions per clock like Core 2 decoder in favorable conditions, but it will not become a bottleneck during programs execution. The average instructions processing speed hardly ever reaches 3 instructions per clock, therefore K10 decoder will be efficient enough for the computational units not to lack any instructions in the queues and hence not to idle.
Instruction Control Unit
Decoded Mop triplets arrive to the Instruction Control Unit (ICU) that moves them to the reorder buffer (ROB). Reorder buffer consists of 24 lines three MOPs in each. Each MOP triplet is written in its own line. As a result, ROB allows the control unit to monitor the status of up to 72 MOPs until they retire.
From the reorder buffer MOPs are being dispatched to the scheduler queues of integer and floating-point units in exact same order, in which they retired from decoder. MOP triplets are stored in the ROB until all older operations are executed and retired. On retirement, the final values are written down into architectural registers and memory. The program order, in which the operations were placed to the ROB, is maintained when operations retire, their data is deleted from ROB and the final values are saved. It is necessary to make sure that the results of all further operations completed ahead of time can be canceled in case of exception or interrupt.








