Moreover, we can easily prove that the efficiency of this strategy will reduce in general as the pipeline grows longer. As you have just seen, we got 8 clock cycles instead of 2 for the pipeline with 6 stages between the scheduler and the execution units. The resulting efficiency in this case equals 25%.
For a pipeline with only one stage distance between the scheduler and the execution unit the efficiency will increase to 67%.
For a pipeline with 666 stages we will get 668 clock cycles instead of 2. The efficiency is 0.3%.
At the same time, if the instruction takes longer to execute, this strategy may actually work much better. Say, for instance, that our pipeline features 6 stages distance between the scheduler and the execution units, but the considered instructions takes 50-100 clock cycles to execute (depending on the circumstances). However, we do not know the exact execution time from the very beginning, but only after about 25 clocks.
The execution unit received micro-operation at  time point.
At  time point the execution unit learns that it will take 51 clock to complete the operation processing.
At the same time point () the scheduler receives the same information. It waits for a while and …
At the same  time point it sends out the dependent micro-operation, which will reach the execution unit exactly at…
The  time point, when it suddenly finds the just obtained result of the previous micro-operation.
In other words, there are such situations when the combination of the pipeline length, micro-operation latency and the time this latency becomes known, that turn this strategy into something truly efficient.
This strategy is 100% efficient, when [the distance between the scheduler and functional units] is smaller than the difference between [the micro-operation latency] and [the time the latency becomes known].
The integer operations do not comply with this condition that is why this strategy doesn’t work for us here.
Third option (optimistic). From the performance point of view, the two previous options we have just discussed are not so interesting for us. The first option is awfully stupid, and the second option is too inefficient. There is only one more option left: to send instructions in advance before we know the execution status of the previous micro-operations.
Let me describe this option in a bit more detail.
The commands can be released one after another hoping for the best in terms of data loading outcome. In our case it will mean that 2 clock cycles after the data load from the memory occurs, the next micro-operation should already be sent. How can we benefit from this strategy?
At the  time point we send the data load micro-operation to the execution unit. It should reach this unit at the [0+6] time point and the scheduler knows about it.
Without waiting for this particular time point, the scheduler releases the next micro-operation at the [0+2] time point (i.e. two clocks down the pipeline from the previous command). What happens next? At [0+6] time point the data load command reaches the execution unit. The next command depending on it is 2 clocks behind. At [0+6+2] time point the data load command receives data from the cache and continues its trip down the pipeline, and the execution unit receives the second micro-operation right in time, by the time the result is ready. So, it turns out that the execution unit works two clocks in a row without pausing.