Information

X-bit Labs for mobile users! Do not forget that we are running a special version of X-bit Labs web-site for users of mobile and handheld devices: http://pda.xbitlabs.com. Check out our news and articles from smartphones and PDAs to be always updated on the latest computer and technology news.

 

Articles: CPU

Replay: Unknown Features of the NetBurst Core (page 8)


Category: CPU

by Victor Kartunov , Yury Malich , Jan Keruchenko aka C@t , and Vadim Levchenko aka VLev

[ 06/06/2005 | 04:20 PM ]


Pages : 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17

Replay at FPU Pipeline

The replay mechanism in the FPU pipeline works according to a different algorithm than the ALU replay. It looks like there is a sort of feedback between the data loading unit and the scheduler. Once the L1 data cache has been checked for data availability and the data has been found there, the scheduler sends the dependent instruction further. So, if the data is reported missing in L1 data cache (such as RL-7 loop for ALU loading), FP-load where x87, MMX, SSE and SSE2 belong, is replayed, but the dependent instructions do not get issued. For RL-12 there is no difference in this case: FP operations are circling in the RL just the same way. If the data is found in L1 cache, the latency of FP-load operations is 9 clock cycles. If the data is not there, we add n*7 or n*12 clock cycles depending on the situation. In fact, we failed to send any chain of FP-operations to RL-7 at all. For example, if there is an Int-chain circling around RL-7, then the dependent FP-chain will get onto RL-12. For instance, two instructions “MOVD MM0,EAX – MOVD EAX,MM0” transfer the Int-chain from RL-7 to RL-12 (EAX dependency).

Why so and not the other way around? We assume that most instructions going via FP Move actually go through something like the “Convert & Classify” K8 unit, where the result is translated into a certain internal representation form (formatting). This hypothesis is proven by the following facts:

  • the inter-register transfers latency is very high;
  • chains of very diverse commands processing the contents of the SSE register, such as “ADDSD XMM0,XMM0 – ADDSS XMM0,XMM0”, result into significant fines.

Maybe most FP Move operations are none other but more or less fixed pairs of primitive commands like “load + convert” or “convert + store”, where the ‘convert” part takes about 6-7 clock cycles. Speaking about replay again: in this (hypothetical) case the time required for “convert” execution exceeds the “distance” in clock cycles between the scheduler and the execution unit. So, the scheduler can safely send the dependent operation further according to the first check result. In case of failure, only the “load + convert” pair will need to be replayed.

<<< Previous page Next page >>>

Discussion

Comments currently: 25
Discussion started: 06/08/05
View comments

Add your Comment

Name/Nickname
Your Comments
 

Category News

Category: CPU

Wednesday, July 23, 2008

3:35 pm AMD to Discuss Rival for Intel Atom Towards Year End. AMD’s Competitor for Intel Atom in the Works, Says Company

Monday, July 21, 2008

8:46 am AMD Initiates Pilot Production of 45nm Chips. AMD to Bring 45nm Products in Early Q4 2008

Thursday, July 17, 2008

2:36 pm AMD’s Chief Executive Officer Hector Ruiz Steps Down. Dirk Meyer Becomes New Chief Exec of AMD

12:15 pm Intel: Atom Will Not Substitute Celeron Processors. Intel Denies Possibility to Change Celeron for Atom

Wednesday, July 16, 2008

11:55 pm Intel Promises to Ship 100 Million 45nm Microprocessors This Year. Intel Says 45nm Process Technology Ramp Better than Ever

7:06 pm Intel to Launch Another Offence with Nehalem Microprocessors Later This Year. Intel to Aggressively Push Nehalem Micro-Architecture into High-End Desktops

 
News Archive
All Latest News