Information

X-bit Labs for mobile users! Do not forget that we are running a special version of X-bit Labs web-site for users of mobile and handheld devices: http://pda.xbitlabs.com. Check out our news and articles from smartphones and PDAs to be always updated on the latest computer and technology news.

 

Articles: CPU

Server Platforms Today (page 22)


Category: CPU

by Victor Kartunov

[ 04/29/2004 | 04:49 PM ]


Pages : 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23

The pipeline is 17 stages long, which is very long for a RISC processor. For this long pipeline not to be idle, the Power4 uses an advanced branch-prediction system. It bases on three tables, each of which contains up to 16K of branching history. The first table uses a traditional branch history buffer with information about whether the branch prediction was successful. The second table (16K too) uses a global, rather than local, branch table. Each entry in this table is associated with an 11-bit counter that remembers which branch was chosen in the last eleven times when instructions were taken from the L1 cache (the load unit loads eight instructions from the L1-I) and whether the prediction was correct. The results of processing this information become the foundation for the results of the next branch prediction.

Let me stress the difference between the two methods: in the first variant we follow each branch instruction without its connection to the others; in the second variant we do directly otherwise, dealing with a sequence of results, without following any particular instruction. That’s why we have two names for the tables: local and global. Now, there’s also a third table that notices which method has been most efficient (caused less errors)! As a result, the Power4 can change the branch prediction method in a few hundreds of CPU cycles.

This Leviathan processor has several variegated system buses: a 32-bit I/O bus (working at one third of the CPU clock rate) and three 128-bit bidirectional buses (working at half of the CPU clock rate) for linking to other processors in the “assemblage”. A 64-bit bus for linking different assemblages crowns the structure. This abundance of buses and the advanced cache hierarchy serve one purpose – making the processors always busy with work. Thanks to the appropriate protocol, the processors can get to each other’s cache (L2 and L3).

Let me explain what I mean by the word “assemblage”. IBM surprised us in the manufacturing aspect too as they managed to produce four processors in one die, with all their buses and 128MB of L3 cache. That’s a manufacturing achievement – the area of the assemblage is 13,225 sq. mm! By the way, each processor (of the four) links to other processors through a “point-to-point” bus.

This technological miracle is of course priced accordingly – about $10,000. However, this is not a high price for a processor of this class.

The topology of systems that use this curious processor is also out of the beaten track. IBM calls it Distributed Switch. Such a topology has no clear center. In fact, the links of the processor assemblages are closed, forming two parallel circles. Thus, it’s possible to get to each processor in several ways, which eliminates jams in the bus. The maximum number of assemblages is 4, or 32 processors in a system. The highest efficiency of this organization allows such a system to perform as fast as 64-128-way systems from other manufacturers.

So we now only have to see what the Power4 shows in SPEC CPU 2000. Note, though, that this test is for a single processor, never focusing on the nuances of the CPU organization. In other words, the total performance in real applications will be higher due to the remarkable system organization.

So, this processor scores 1077 points in SPEC_int base 2000. This is an average result. In any case, that’s more than any other RISC processor scored (save for the Itanium, which is not a pure RISC).

The result is better in SPEC_fp base 2000 – 1598 points. In fact, only the Itanium managed to outperform it. The Power4 is good in this kind of tests.

Once again, this test cannot capitalize on the main advantages of Power4-based systems. In real applications (and in real systems), the Power4 is the world’s fastest CPU in the number of transactions per processor.

<<< Previous page Next page >>>

Discussion

Comments currently: 20
Discussion started: 04/30/04
View comments

Add your Comment

Name/Nickname
Your Comments
 

Category News

Category: CPU

Thursday, July 17, 2008

2:36 pm AMD’s Chief Executive Officer Hector Ruiz Steps Down. Dirk Meyer Becomes New Chief Exec of AMD

12:15 pm Intel: Atom Will Not Substitute Celeron Processors. Intel Denies Possibility to Change Celeron for Atom

Wednesday, July 16, 2008

11:55 pm Intel Promises to Ship 100 Million 45nm Microprocessors This Year. Intel Says 45nm Process Technology Ramp Better than Ever

7:06 pm Intel to Launch Another Offence with Nehalem Microprocessors Later This Year. Intel to Aggressively Push Nehalem Micro-Architecture into High-End Desktops

Tuesday, July 8, 2008

11:01 pm DreamWorks and Intel Sign Pact: Larrabee, Xeon Set to Be Used. DreamWorks Switches from AMD to Intel

6:07 pm AMD Loses Microprocessor Revenue Share to Intel – iSuppli. AMD, Intel Continue to Gain CPU Revenue Share

 
News Archive
All Latest News