Information

X-bit Labs for mobile users! Do not forget that we are running a special version of X-bit Labs web-site for users of mobile and handheld devices: http://pda.xbitlabs.com. Check out our news and articles from smartphones and PDAs to be always updated on the latest computer and technology news.

 

Articles: CPU

Server Platforms Today (page 21)


Category: CPU

by Victor Kartunov

[ 04/29/2004 | 04:49 PM ]


Pages : 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23

IBM Power4

You feel embarrassed applying a name like “microprocessor” to the IBM Power4. The die is monstrous – an assemblage of four processors with a L3 cache is a square of 115x115mm! That’s the size – 13225 square millimeters! The “micro” has nothing to do with this microprocessor.

Well, if someone makes processors of that size, someone certainly needs them. Let’s see what it has inside. First of all, the Power4 contains two processor cores. You can see them in the following figure:

You see that the internal structure of the processor is nontrivial. Two processor cores are linked with a special high-speed switch. In fact, we have an SMP system within one CPU – the cores are joined with a bus that works at 500MHz!

Other subsystems are impressive, too: the L2 cache uses three independent cache controllers, three banks (you see them in the figure) of a total capacity of 1536KB and has a bandwidth of over 100GB/s working at 1.7GHz (the frequency of the flagship Power4+ model).

The processor core is curious in itself. First of all, the IBM Power4 decodes the external instruction set into internal microinstructions like x86 CPUs do. The reasons for this solution are obvious: there’s too much software written for the previous CPU generation, which costs more than hardware. They just couldn’t abandon that software baggage. Thus, the same problems met the same solutions.

The micro-architecture is designed to perform up to eight instructions per cycle – that’s an impressive degree of parallel execution.

Let’s now see what a single core looks like. The decoder translates external instructions into a set of elementary operations (ops) that are then packed into groups. One command is usually unfolded into two or three ops. A group contains five commands – the first four cells are distributed freely, while the fifth cell always contains a branch prediction instruction. Commands go for execution in such groups, moving along the pipeline.

Each core has two ALUs, two FPUs (with slightly different functions; for example, division is only performed by the FPU2), two load/store units, two branch prediction units. Overall, we have eight functional blocks. Out-of-order execution is supported – the Group Completion Table (an analog of the Reorder Buffer in Xeon processors) can contain up to 20 groups of elementary operations (i.e. about 100 ops), sending them to the execution units as they are ready. Overall, the processor can have as many as 215 instructions at various execution stages in a given moment.

Besides that, the core can launch “addition plus shift” operation each cycle on each FPU. This operation often occurs in various programs. Thus, we have four FPU operations per cycle, which is an absolute record among all processors (well, nearly each characteristic of the IBM Power4 CPU aspires to be record-breaking). It’s also possible to launch two floating-point addition or multiplication operations at a time, which none other micro-architecture allows.

The cache subsystem tries to match this record-setting trend. Each core has 32KB of dual-channel data cache (with an access latency of only 1 cycle!) and 64KB of dual-channel instruction cache. Each cache consists of 128-byte lines; the data cache is organized as four 32-byte sectors, which can be read independently (it’s possible to write into one sector and read from two others, without jams). The instruction cache can write or read 32 bytes each cycle. The L2 cache is eight-channel, partially-associative, 128 bytes per line, 1536KB size. Each processor also contains an L3 cache controller. The amount of L3 cache memory can be up to 32MB per processor (per two cores). The processor also has a memory controller with a bandwidth of 11GB/s per processor. The maximum amount of memory supported by each processor is 16GB.

<<< Previous page Next page >>>

Discussion

Comments currently: 20
Discussion started: 04/30/04
View comments

Add your Comment

Name/Nickname
Your Comments
 

Category News

Category: CPU

Wednesday, July 23, 2008

3:35 pm AMD to Discuss Rival for Intel Atom Towards Year End. AMD’s Competitor for Intel Atom in the Works, Says Company

Monday, July 21, 2008

8:46 am AMD Initiates Pilot Production of 45nm Chips. AMD to Bring 45nm Products in Early Q4 2008

Thursday, July 17, 2008

2:36 pm AMD’s Chief Executive Officer Hector Ruiz Steps Down. Dirk Meyer Becomes New Chief Exec of AMD

12:15 pm Intel: Atom Will Not Substitute Celeron Processors. Intel Denies Possibility to Change Celeron for Atom

Wednesday, July 16, 2008

11:55 pm Intel Promises to Ship 100 Million 45nm Microprocessors This Year. Intel Says 45nm Process Technology Ramp Better than Ever

7:06 pm Intel to Launch Another Offence with Nehalem Microprocessors Later This Year. Intel to Aggressively Push Nehalem Micro-Architecture into High-End Desktops

 
News Archive
All Latest News