Information

X-bit Labs for mobile users! Do not forget that we are running a special version of X-bit Labs web-site for users of mobile and handheld devices: http://pda.xbitlabs.com. Check out our news and articles from smartphones and PDAs to be always updated on the latest computer and technology news.

 

Articles: CPU

Replay: Unknown Features of the NetBurst Core


Category: CPU

by Victor Kartunov , Yury Malich , Jan Keruchenko aka C@t , and Vadim Levchenko aka VLev

[ 06/06/2005 | 04:20 PM ]

In the third part of our NetBurst Architecture investigation trilogy we are going to reveal the details of the Replay mechanism Implemented in Intel Pentium 4 processors, which Intel keeps quiet about. This particular mechanism and its working principles explain why Pentium 4 processors perform pretty slowly, despite their high working frequencies.


Table of contents:


Pages : 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17

Since the day Intel announced its Pentium 4 processor a lot of questions appeared about the strange results this processor demonstrated in a number of tasks. Although Pentium 4 processors boasted higher working frequencies and specific architectural features, such as Trace Cache, Rapid Execution Engine, Quad-Pumped Bus, Hardware prefetch and even Hyper-Treading, which were supposed to increase the number of commands to be processed per processor clock, Pentium 4 processors turned out unable to outperform their counterparts (Pentium M) as well as their competitors (AMD Athlon) working at lower frequencies. Most reviewers would usually explain these performance issues with the longer pipeline and sometimes with the small cache memory capacity or higher memory latency. Quite rarely some other reasons would be suggested here.

<%BANNER[article]%>

However, all these things I have just mentioned fail to really explain certain anomalies, which you can come across during your tests. As an example, let’s consider a situation when we test memory latency with a chain of dependent commands mov eax, [eax] (the so-called pointer-chasing) "with aggravation", when the chain of dependent load commands is enlarged with a chain of ADD operations: X * { mov eax,[eax] - N*{add eax, 0} }.

If we know how long the addition takes, we can determine time T for the load operations as the time required for single iteration processing minus the time required for a chain of N additions. If everything had been fairly simple, then the T (N) dependence graph would have been a horizontal line, with the location determined as the ideal L2 cache access time, i.e. 9=2+7. In reality the graph looks as follows, and it is simply impossible to explain its shape and behavior with the documentation and info Intel’s optimization guides offer us:


Pic. 1: Pentium 4 (Northwood) L2 cache latency testing
with a dependency chain X*{mov eax,[eax] - N*{add eax,eax}}.

Luckily there is at least one hint in the optimization guides. This is a very scarce and superficial description of a mechanism called replay. Here is a quote:

«Replay

In order to maximize performance for the common case, the Intel NetBurst micro-architecture sometimes aggressively schedules µops for execution before all the conditions for correct execution are guaranteed to be satisfied. In the event that all of these conditions are not satisfied, µops must be reissued. This mechanism is called replay.
Some occurrences of replays are caused by cache misses, dependence violations (for example, store forwarding problems), and unforeseen resource constraints. In normal operation, some number of replays are common and unavoidable. An excessive number of replays indicate that there is a performance problem

This scarce explanation gives us to understand that replay may cause serious problems in case a cache-miss occurs. In fact, it occurred to us after reading this description that replay could possibly explain the shape of the L2 cache latency graph. Our search for additional information in official documents and articles ended in vain. All the data we could dig out comes from patents.

So, the article you are about to read appeared as a result of our detailed study of the following Intel patents:

  • Patent 6,163,838 “Computer processor with a replay system”;
  • Patent 6,094,717 “Computer processor with a replay system having a plurality of checkers”;
  • Patent 6,385,715 “Multi-threading for a processor utilizing a replay queue”;

Also we carried out and analyzed the whole bunch of benchmarks. We paid most attention to Northwood processor core here. As for the detailed study of the Prescott processor core, we are still working on it, as it requires a lot of time and resources.

Next page >>>

Discussion

Comments currently: 25
Discussion started: 06/08/05
View comments

Add your Comment

Name/Nickname
Your Comments
 

Category News

Category: CPU

Tuesday, May 13, 2008

4:25 pm Nvidia Has No Plans to Take Over Via Technologies, Says Chief Exec. Nvidia Denies Intentions to Buy Via Technologies – CEO

Monday, May 12, 2008

1:47 pm AMD Releases Its First Low-Power Quad-Core AMD Opteron HE Chips. AMD Unveils “Highly-Efficient” Quad-Core AMD Opteron Processors

Friday, May 9, 2008

3:39 pm Toshiba Plans to Equip Multimedia Laptops with SpursEngine Processor. Toshiba’s SpursEngine Chip to Find Home in Company’s Notebooks

Thursday, May 8, 2008

7:58 am Advanced Micro Devices Updates Server Roadmap. AMD Cancels Montreal, But Introduces Sao Paolo, Magny Cours

Wednesday, April 30, 2008

10:11 pm Demand for Intel Atom Processors Exceeds Expectations. Intel Blames Customers for Intel Atom Shortage

 
News Archive
All Latest News