At the  moment of time we received the micro-operation. The data from the L2 cache will arrive at the moment of time indicated as [0 + L2 cache access latency]. Northwood core features 9 clock cycles L2 cache access latency in the general case (to be more exact it equals 7 clocks, but the “data load” command will first check if the data is available in the L1 data cache, which requires 2 additional clocks). So, the scheduler will send out the next micro-operation, so that it arrives at the execution unit 9 clocks later.
In fact, this option will hardly work for us, as it takes 9 clock cycles to execute only one micro-operation. We will not accept this scheduler strategy, because it definitely is not the right way to high performance.
Second option (upon agreement). The idea is to delay all micro-operations depending on the results of the data load command until the data arrives, and then start sending micro-operations for further execution. The good thing about this strategy is that it doesn’t require any additional effort: sit and wait for the data. The negative side of it is that it doesn’t always guarantee good performance in the long run.
If we had a micro-operation of the second type, the scheduler could take into account the info about its execution status from the execution units. In this case the scheduler would need to receive feedback from the execution units about the estimated execution time for the given instruction. In fact, this is quite possible (that this strategy is applied to the FPU load), however, there is one unpleasant issue.
Suppose that we were really lucky and the data is available in the L1 cache. By Northwood processor core, the data will take two clock cycles to be delivered from L1 cache.
Say, the execution unit received a micro-operation at the  time point. At the [0+2 clocks] point, it sent the status report to the scheduler and received the data from L1 cache. It immediately reports to the scheduler and the latter immediately releases the next micro-operation to the pipeline. This micro-operation will take 6 clock cycles to reach the execution unit.
Everything seems to be correct, but what have we got in the end? Let’s sum up the results: our second micro-operation will reach the execution unit in 0+2+6 clocks, because it still needs to pass all the stages between the scheduler and the execution unit: the distance between them hasn’t got any smaller. It means we need 8 clock cycles total. It turns out that the dependent instruction started moving towards the execution unit not when the data is already ready - [0+2 clocks] time point, but at [0+2+6 clocks], i.e. 6 clock cycles later. In other words, we lost 6 clock cycles!
Well, this is not the best option, I should say.