This way we managed to achieve maximum processing speed for the micro-operations: our execution units process a micro-operation every considered time interval. Ideally, there should be no idling in this case and the performance of this entire structure will be the highest.
But let me stop here for a while and discuss in a bit greater detail the release of commands “in advance”. Note that the scheduler sends the micro-operation for execution so that by the time it arrives at the corresponding unit all operands have already been calculated. Since it takes a few stages (clock cycles) before the operation arrives at the unit, the scheduler should be able to estimate the readiness of operands a few clocks ahead. Here it should also take into account the time it takes to execute the previous micro-operations, if their results are taken as operands for the next micro-operations. If we have an operation of fixed latency (it means we know it in advance), the task can be solved in no time. However, there are certain instructions, when you cannot predict how much time their execution is going to take. For example, when we load some data from the memory, the time wee need to complete this process will depend on the hierarchical level of the cache/memory subsystem our data are stored in.
This way, the scheduler splits all micro-operations in two groups: micro-operations with known execution time (fixed latency) and micro-operations with unknown execution time (variable latency).
Of course, the first group of micro-operation doesn’t prepare any unpleasant surprised for the scheduler: if ADD operation requires one clock cycle to be executed, then it means that the results of this addition will already be available on the next clock. And the next operation can be sent to the execution unit by the next clock, so that our pipeline gets loaded in the most efficient manner.
When we have a micro-operation of the second type defined above, the scheduler has a few options that allow it not to halt the pipeline. Say, we have a command to load some data from the memory.
First option (carefully straightforward). Suppose that we will keep in mind the worst possible instruction execution result. Here we do not consider such hopeless options as waiting for the data to arrive from the swap-file, which will take millions of clock cycles, or when the data is located very far away, say, in the RAM, which will take hundreds of clock cycles. In our example we will have the data in the L2 cache. In fact, this supposition may look unreasonable to you: why, on earth, do we need the L1 data cache with its low latency, if we don’t take advantage of this low latency? This strategy is already a failure, but we will still evaluate what this is going to cost us, once it happens.
Ok, the data is in the L2 cache. Say the distance from the scheduler to the execution unit is 6 stages (which the micro-operation will pass in 6 clock cycles respectively).