Information

X-bit Labs for mobile users! Do not forget that we are running a special version of X-bit Labs web-site for users of mobile and handheld devices: http://pda.xbitlabs.com. Check out our news and articles from smartphones and PDAs to be always updated on the latest computer and technology news.

 

Articles: CPU

Replay: Unknown Features of the NetBurst Core (page 4)


Category: CPU

by Victor Kartunov , Yury Malich , Jan Keruchenko aka C@t , and Vadim Levchenko aka VLev

[ 06/06/2005 | 04:20 PM ]


Pages : 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17

“Holes”

Since the distance between the operations returned to replay system remains the same all the time, sometimes there appear empty unoccupied staged between them. The Checker never sends the stop-signal to the scheduler for them. Let’s call these empty stages “holes”. The scheduler can send out a command for execution if the clock cycle coincides with a hole. This allows using the processor computational resources most efficiently, as we can mix together commands from different dependency chains. Let’s take a look at the following example:

LD R1, [X] // loading X into R1 register
ADD R1, R2 //1 – R1 = R1+R2
ADD R1, R2 //2 – R1 = R1+R2
ADD R1, R2 //3 – R1 = R1+R2
ADD R1, R2 //4 – R1 = R1+R2
ADD R1, R2 //5 – R1 = R1+R2
LD R3, [Y] // loading Y into R3 register
ADD R3, R4 //6 – R3 = R3+R4
ADD R3, R4 //7 – R3 = R3+R4
ADD R3, R4 //8 – R3 = R3+R4
...

We’ve got two chains of dependent commands: a chain of R1 register dependencies and a chain of R3 register dependencies. To simplify the example suppose that all commands of all types are sent to the same scheduler one by one, that is why the LD R3, [Y] command cannot be scheduled for execution before the fifth ADD R1, R2 command. Let’s take the L1 loading command latency equal to 2 clock cycles, and ADD latency – one clock cycle.

Here are two cases to be considered:


Pic.4a


Pic.4b

  1. X and Y values are in L1 cache (L1 hit, Pic.4a). No surprises here. No commands go to the replay system, all commands retire one by one.
  2. X is not in L1 cache, but it is found in L2 cache (L1 miss). Y is in L1 cache (Pic.4b). This is a much more interesting case. Until the 6th clock cycle the scheduler sends all commands out for execution keeping in mind their estimated latencies. The first LD command reaches the Check stage at 6th clock cycle on the second pipeline, receives the L1 cache miss signal and turned to replay (at the same time X read request for L2 cache is generated). All next 5 ADD commands will not receive a correct operant and will follow the first LD command to the replay system. By the time the first LD command reached the replay mux, the scheduler will have already sent the second LD command after ADD5in 8th clock cycle, and at that moment it also received the signal to leave one slot free for the first LD command in the next clock cycle. At 9th clock cycle the first LD command reaches the replay mux and is resent. At 10th clock cycle replay loop delivers no commands, that is why the scheduler has to fill in the “hole” with another ADD6 command waiting for the execution resources. In the next clock cycles, the ADD1-ADD5 commands following the first LD are resent for re-execution. After the last ADD5 command, the scheduler will be able to fit in ADD7 and ADD8 into the available slots. At 14th clock cycle the first LD command will finally receive the data from L2 cache and will be executed correctly, so the first LD and the following ADD1-ADD5 commands will get executed and will retire.

This example shows that the scheduler tries to use the computational resources more efficiently by using the “hole” between the LD and ADD1 commands. It inserts ADD6 command from the independent succession there. Unfortunately, there is another side to this hunt for efficiency. Let’s talk more about it now.

<<< Previous page Next page >>>

Discussion

Comments currently: 25
Discussion started: 06/08/05
View comments

Add your Comment

Name/Nickname
Your Comments
 

Category News

Category: CPU

Wednesday, July 23, 2008

3:35 pm AMD to Discuss Rival for Intel Atom Towards Year End. AMD’s Competitor for Intel Atom in the Works, Says Company

Monday, July 21, 2008

8:46 am AMD Initiates Pilot Production of 45nm Chips. AMD to Bring 45nm Products in Early Q4 2008

Thursday, July 17, 2008

2:36 pm AMD’s Chief Executive Officer Hector Ruiz Steps Down. Dirk Meyer Becomes New Chief Exec of AMD

12:15 pm Intel: Atom Will Not Substitute Celeron Processors. Intel Denies Possibility to Change Celeron for Atom

Wednesday, July 16, 2008

11:55 pm Intel Promises to Ship 100 Million 45nm Microprocessors This Year. Intel Says 45nm Process Technology Ramp Better than Ever

7:06 pm Intel to Launch Another Offence with Nehalem Microprocessors Later This Year. Intel to Aggressively Push Nehalem Micro-Architecture into High-End Desktops

 
News Archive
All Latest News