Since the distance between the operations returned to replay system remains the same all the time, sometimes there appear empty unoccupied staged between them. The Checker never sends the stop-signal to the scheduler for them. Let’s call these empty stages “holes”. The scheduler can send out a command for execution if the clock cycle coincides with a hole. This allows using the processor computational resources most efficiently, as we can mix together commands from different dependency chains. Let’s take a look at the following example:
LD R1, [X] // loading X into R1 register
ADD R1, R2 //1 – R1 = R1+R2
ADD R1, R2 //2 – R1 = R1+R2
ADD R1, R2 //3 – R1 = R1+R2
ADD R1, R2 //4 – R1 = R1+R2
ADD R1, R2 //5 – R1 = R1+R2
LD R3, [Y] // loading Y into R3 register
ADD R3, R4 //6 – R3 = R3+R4
ADD R3, R4 //7 – R3 = R3+R4
ADD R3, R4 //8 – R3 = R3+R4
We’ve got two chains of dependent commands: a chain of R1 register dependencies and a chain of R3 register dependencies. To simplify the example suppose that all commands of all types are sent to the same scheduler one by one, that is why the LD R3, [Y] command cannot be scheduled for execution before the fifth ADD R1, R2 command. Let’s take the L1 loading command latency equal to 2 clock cycles, and ADD latency – one clock cycle.
Here are two cases to be considered:
- X and Y values are in L1 cache (L1 hit, Pic.4a). No surprises here. No commands go to the replay system, all commands retire one by one.
- X is not in L1 cache, but it is found in L2 cache (L1 miss). Y is in L1 cache (Pic.4b). This is a much more interesting case. Until the 6th clock cycle the scheduler sends all commands out for execution keeping in mind their estimated latencies. The first LD command reaches the Check stage at 6th clock cycle on the second pipeline, receives the L1 cache miss signal and turned to replay (at the same time X read request for L2 cache is generated). All next 5 ADD commands will not receive a correct operant and will follow the first LD command to the replay system. By the time the first LD command reached the replay mux, the scheduler will have already sent the second LD command after ADD5in 8th clock cycle, and at that moment it also received the signal to leave one slot free for the first LD command in the next clock cycle. At 9th clock cycle the first LD command reaches the replay mux and is resent. At 10th clock cycle replay loop delivers no commands, that is why the scheduler has to fill in the “hole” with another ADD6 command waiting for the execution resources. In the next clock cycles, the ADD1-ADD5 commands following the first LD are resent for re-execution. After the last ADD5 command, the scheduler will be able to fit in ADD7 and ADD8 into the available slots. At 14th clock cycle the first LD command will finally receive the data from L2 cache and will be executed correctly, so the first LD and the following ADD1-ADD5 commands will get executed and will retire.
This example shows that the scheduler tries to use the computational resources more efficiently by using the “hole” between the LD and ADD1 commands. It inserts ADD6 command from the independent succession there. Unfortunately, there is another side to this hunt for efficiency. Let’s talk more about it now.