AMD's Next Generation Microarchitecture Preview: from K8 to K8L

New Intel Core 2 Duo processors took the lead in the processor market these days and AMD has nothing to respond with at this time. However, next year AMD promises to release a Conroe-killer, the CPU based on new K8L microarchitecture. Our newest article is devoted to the details of AMD K8L microarchitecture and how it compares with Core and K8 from the theoretical prospective.

by Yury Malich
08/22/2006 | 07:36 PM

July 27, 2006, Intel officially introduced its new Core 2 processor to the public. Based on the Conroe core, it proved to be a breakthrough in terms of CPU performance. AMD just doesn’t have a processor that could match the newest top-end models of the Core 2 series.

 

Unfortunately for AMD, the company was overconfident with its highly successful K8 series and didn’t prepare a timely response. According to the recently published roadmaps, AMD will release a new processor currently known under its codename of K8L (other sources refer to it as the K10) in the middle of the next year, i.e. a full year after Intel’s release of the Core 2.

The K8L is being originally designed as a quad-core chip. Its four execution cores will reside on a single silicon wafer and will all be connected to a shared L3 cache, to a common crossbar and to a common memory controller.

The already known and expected micro-architectural improvements in the upcoming processor will be discussed in this article. We will compare it with the Conroe micro-architecture to evaluate its possible fortes and weaknesses as well as future perspectives.


Fig. 1

Instruction Fetching

Each clock cycle the K8 processor is fetching instructions in aligned 16-byte blocks from the L1I instruction cache into a buffer where the instructions are extracted from the block and are then sent to the decoder’s channels. The 16 bytes per cycle fetch rate allows sending 3 instructions with an average length of up to 5 bytes for decoding on each cycle. In certain algorithms, however, the average instruction length in a chain may be bigger than 5 bytes.

Particularly, the length of a simple SSE2 instruction with register-register operands (for example, MOVAPD XMM0, XMM1) is 4 bytes. If an instruction uses indirect addressing (with a base register and an offset like in MOVAPD XMM0, [EAX+16]), its length increases to 6-8 bytes. In 64-bit mode a one-byte REX prefix is added to the instruction code when the additional registers are employed. Thus, SSE2 instructions may be as long as 7-9 bytes in 64-bit mode. An SSE1 instruction may be 1 byte shorter if it is a vector one (that is, operates on four 32-bit values), but is 7-9 bytes, too, under the same conditions if it is a scalar one (with one operand).

In this situation, the fetch rate of 16 bytes per cycle doesn’t seem fast enough to keep up the decoding speed at a rate of 3 instructions per cycle. This limitation is not important for the K8 processor because vector SSE and SSE2 instructions are decoded at a rate of 3 instructions per 2 clock cycles (or 1.5 instructions per cycle), which is enough to load the two 64-bit FPUs. In the future processor, a rate of at least 3 instructions per cycle must be maintained. Considering this, the fetching in 32-byte blocks, announced in the presentation of architectural innovations of the K8L, doesn’t seem excessive. If the succession of these long commands takes a few neighboring 16-byte blocks, then the average fetching tempo with 16-byte data blocks of 3 commands per clock cannot be achieved.


Fig. 2

Figure 2 illustrates the positioning of five long instructions in a 32-byte block which can be fetched in one clock cycle. If the instructions are fetched in 16-byte blocks, it is impossible to achieve a fetch rate of 3 instructions per cycle.

By the way, Conroe processors fetch instructions in 16-byte blocks, just like K8 processors do, so they can decode the instruction stream at a rate of 4 instructions per clock only when the average instruction length is no longer than 4 bytes. Otherwise the decoder cannot process not only 4 but even 3 instructions per clock. To fight this in short loops, the Conroe has a special 64-byte internal buffer that caches loops up to 64 bytes long (four 16-byte blocks) and allows fetching data in such loops at a rate of 32 bytes per cycle. If a loop is longer than 4 blocks, it cannot be cached in this buffer.

The fetching of the next block of instructions is done using the branch prediction mechanism if there are any branch instructions present. Branches are predicted in the K8 processor by means of simpler algorithms than those employed in the Conroe. For example, the K8 cannot predict alternating indirect branches (this may have a negative effect on the execution of object-oriented polymorph code) and is also doesn’t always predict correctly regular patterns. The branch prediction mechanism will be improved in the K8L, but there’s no detailed info about that yet. The branch tables and counters will probably be made larger, and the algorithm of predicting branches alternating in regular patterns may be improved.

Decoding

The x86 instructions extracted from the block of bytes are decoded into macro-operations. A macro-op consists of two micro-operations: an integer or floating-point arithmetic micro-op and an address operation for memory access. The splitting into micro-ops is done by the scheduler prior to sending them for execution. The decoder of K8 processors distinguishes between three types of instructions:

In a K8 processor, DirectPath and VectorPath instructions cannot be dispatched simultaneously. The decoders are issuing the decoded results at a rate of 3 macro-ops per cycle. Thus, the hardware decoder can decode 3 single instructions, 1 double and 1 single instruction or 1.5 double instructions (3 double instructions per two cycles). Since one VectorPath instruction can be decoded into more than 3 macro-ops, it can take more than 1 cycle to decode such instructions.

The macro-ops produced by the decoder each clock cycle are united into groups. A group consisting of 2 or even 1 macro-op is possible due to alternation of DirectPath and VectorPath commands and to various instruction fetch latencies. Such a group is completed with empty macro-ops so that there are thee macro-ops in total and is then dispatched.

VectorPath instructions from the SSE, SSE2 and SSE3 sets are divided in the K8 processor into pairs of macro-ops that separately process the top and bottom 64-bit parts of a 128-bit SSE register on 64-bit execution units. That’s why such instructions are decoded in the K8 processor at a rate of 3 instructions per 2 clock cycles. The width of the SSE devices in the future K8L processor will be expanded to 128 bits, so there is now no need to split vector instructions in two parts. The algorithm of decoding such instructions will obviously be changed in such a way that vector instructions could be decoded into single 128-bit macro-ops at a rate of 3 instructions per cycle.

Although the decoder of the K8L processor may not be able to decode 4-5 instructions per cycle, just the way Conroe can do it under favorable conditions, it will not hinder programs execution, because the commands are on average executed at less than 3 commands per cycle. K8 usually decodes one x86 instruction into fewer macro-operations than Conroe CPU would do. This, as well as the 32-byte fetch set, make its decoder highly efficient.

Integer Instructions

The decoded triplets of macro-ops arrive at the instruction control unit (ICU), which puts information about them into the reorder buffer (ROB), and are then transferred to the schedulers. The ROB keeps track of the state of the macro-ops and controls the order of their retirement. The macro-ops come into the queues and are retired in groups of three (in lines) in the same manner as they have arrived at the ICU, but are received by the schedulers and are dispatched to the execution units independently.

The macro-ops from each group are distributed among the three independent queues of the scheduler, 8 elements each (24 macro-ops in total), assigned to three symmetrical integer channels. The queue number corresponds to the position of the macro-op in the group as it was shaped on the decoder’s output. As soon as the data is ready, the scheduler can dispatch one integer operation to the ALU and one address operation to the AGU from each queue. There can be two simultaneous memory accesses at most. Thus, each clock cycle 3 integer operations and 2 memory operations (64-bit reads and writes in any combination) can be dispatched. Integer operations are dispatched from the queues out of order as soon as the data is ready for them. But the load from memory operation is performed in program order, for example:

add ebx, ecx       ;
mov eax, [ebx+10h] ; quick address calculation
mov ecx, [eax+ebx] ; the address depends on the result of the previous instruction
mov edx, [ebx+24h] ; this instruction won’t be executed until the addresses of all the
                              ; previous instructions are calculated

This is one of the limiting factors in the K8 processor. It is because of it that the K8, although can dispatch two read instructions per clock, may be less efficient with memory than the Conroe, which can only dispatch one read instruction per clock but has a mechanism for speculative out-of-order execution of read instructions bypassing previous reads and writes (called Memory Disambiguation). Fortunately, a mechanism for out-of-order loads will appear in the K8L, and this bottleneck will be eliminated. Details of this mechanism are not yet disclosed, but the reordering of read instructions will not probably affect write instructions, and this may become a reason for less efficient execution of some types of code.

A group of three macro-ops is removed from the ROB after all the instructions from this group are executed. The queuing and removal of macro-ops in groups simplifies control over the resources and helps load the schedulers in a more efficient way. If one of the three queues is fully loaded, new triplets of macro-ops cannot arrive at the scheduler and empty slots may appear in the other queues. However, there is a small percent of free slots in practice, and they do not worsen the CPU efficiency much.

Besides that, there can theoretically occur a certain reduction in the scheduler efficiency due to the static linking of the position of a macro-op in the group to the scheduler’s queue because one queue can have two or more micro-ops ready for execution, and another can have none (Figure 3). But this is not very probable in practice and is not frequently observed since there are usually quite enough of execution-ready instructions in the pipeline.


Fig. 3

Unlike in the K8, there is a common queue for all instructions, including floating-point ones, in the Conroe. The queue length is 32 macro-ops. The common queue theoretically helps avoid empty slots and the possible limitations due to the static linking to the execution units. Besides that, the stack engine mechanism helps reduce the number of data dependencies between the instructions PUSH, POP, CALL and RET. In practice, however, it is very difficult to organize a fully associative queue from which all 5 micro-ops could be dispatched simultaneously. That’s why the common queue is still divided into sections. This produces a reverse effect: the insufficiently ordered selection of execution-ready instructions leads to the so-called chaotic scheduling problem of P6+ family processors and reduces the execution speed. Besides that, the Conroe has a limitation on the number of registers that can be simultaneously read from the ROB (not more than three), which puts a limitation on the scheduling of the instruction stream.

The out-of-order execution mechanisms in the K8L and the Conroe differ but slightly, because both processors can send up to 5 commands per clock cycle to be executed (3 ALU + 2 Mem). The peculiarities of scheduling and execution algorithms may show differently depending on the code generated by the compiler.

Floating Point Instructions

In the K8 processor the scheduler of floating-point instructions is separate from the integer instruction scheduler and is designed in a different way. The scheduler buffer can take in up to 12 groups of three macro-ops (theoretically, 36 floating-point operations). Moreover, a queue consolidation mechanism is employed to make one complete and one empty triplet out of two incomplete triplets. Unlike the integer instruction execution unit with symmetrical computational channels, the FPU contains three different units FADD, FMUL and FMISC (it is FSTORE, too) for floating-point addition, multiplication and auxiliary operations, so the scheduler buffer doesn’t link the position of a macro-op in an instruction group to a particular execution unit (Figure 4).


Fig. 4

Each clock cycle one operation can be dispatched to one of the K8’s 80-bit FPUs. 128-bit SSE instructions are divided into two 64-bit macro-ops on the decoding step. These macro-ops are then dispatched sequentially in two cycles. Theoretically, up to three macro-ops can be dispatched each cycle, but this rate is unachievable in practice due to the decoding limitations because besides floating-point instructions there are also auxiliary commands of loads, loops, etc. in the code. Moreover, the simple scheduling algorithm doesn’t always distribute the operations in the free devices in the optimal order, and this can reduce the dispatch rate due to inefficient utilization of the execution devices.

Thanks to its two 64-bit read buses, the K8 processor can receive up to two 64-bit operands per cycle from the L1 cache, which helps the processor keep up a high execution rate when floating-point instructions are frequently accessing data in memory. This is an important feature of the architecture since four operands are necessary to execute two instructions in parallel (two operands per one instruction), and two out of four operands are usually read from memory in a number of algorithms for processing streaming data.

In the K8L processor the FADD and FMUL devices will be expanded to 128 bits (Figure 5), which will help double the theoretical floating-point performance with code that uses vector SSE instructions (not only due to a doubled dispatch rate, but also due to an increased decoding and retiring rate caused by the reduced number of generated macro-ops).


Fig. 5

The buses for reading data from the cache will also become two times wider, which will enable the processor to perform two 128-bit data loads from the L1 cache per cycle. The ability to perform two 128-bit data reads per cycle can give the K8L an advantage in some algorithms over a Conroe-core processor that is only capable of performing one 128-bit load.

According to the revealed information, the FMISC (FSTORE) device will remain a 64-bit one. This is illogical since writes from the 128-bit SSE registers into memory are executed on the FMISC (FSTORE) unit, and if this unit is left as it is now, it will automatically become a bottleneck in streaming calculations due to the inability of the CPU to perform a 128-bit data write each clock cycle. So, there is an opinion that AMD’s presentation contains an error and the FMISC (FSTORE) unit will still be able to perform writes at the full rate, i.e. 128 bits per cycle. Or perhaps the saving of 128-bit data will be implemented in the other two units while the auxiliary operations unit will work at half the full rate, which won’t be crucial for overall performance. The scheduling algorithm, which doesn’t always work optimally in the K8, calls for improvement, too.

As we said above, the Conroe, unlike the K8, uses a common queue for integer and floating-point instructions with all the ensuing advantages and shortcomings. Besides that, the ports for dispatching integer and floating-point instructions are combined, which prohibits certain combinations of integer and floating-point instructions. Another limitation in the Conroe, which can show up in floating-point algorithms that make use of x87 commands (i.e. without SSE optimizations), is the two times lower rate of dispatching the multiplication instruction FMUL.

Besides having wider execution units, the K8L will also have wider integer units inside the FADD and FMUL blocks that deal with SSE2 commands processing. As a result, the integer applications using these instruction sets will work faster. Also K8L will learn to perform a few extra SSE instructions that we won’t discuss here.

The FPU in the K8L is going to be very effective, more effective in some parameters (like the ability to read two 128-bit values per cycle) than the Conroe’s.

Memory Subsystem

The K8 has a separated L1 cache: 64KB for instructions and data each. Each cache is 2-way set associative; the line length is 64 bytes. This low associativity is due to the ability of the core to perform two reads from the cache per clock cycle. It is compensated by the rather large size (64KB); in inconvenient algorithms the effective size of the cache will be close in efficiency to a 32KB cache with an associativity of 4. In algorithms with sequential access a large L1 cache that allows performing two reads per cycle looks preferable.

The L2 cache (paired with the L1 cache) is exclusive: the data in the L1 and L2 caches are not duplicated. The L1 and L2 caches exchange data across two unidirectional buses (one goes from the L1 to the L2 and one goes from the L2 to the L1), each 64 bits or 8 bytes wide (Figure 6). With this organization, the processor receives data from the L2 cache at a rather slow rate of 8 bytes per clock (8 clocks to transfer a 64-byte line). As a result, the data transfer latency is high, especially when two or more lines in the L2 cache are being accessed simultaneously. The latency is somewhat compensated by the increased number of cache hits due the high associativity of the L2 cache, which is 16, and due to the larger total amount of cache memory (thanks to the exclusive design).


Fig. 6

A memory controller resides on the same silicon wafer with the CPU core. Data flows from the memory controller through the crossbar and directly into the L1 cache, bypassing the L2 cache. This reduces the latency of data transfers from system RAM. The L2 cache only receives those data that have been pushed out of the L1.

As was mentioned above, the K8L processor’s L1 cache can provide up to two 128-bit reads or one read and one write per clock. Unfortunately, there is no info that the width or design of the bus that connects the L1 and L2 caches will be changed. We hope, however, that it will be at least broadened twofold. Otherwise, the slow inter-cache bus is going to limit the CPU performance when running code with streaming floating-point instructions: the powerful computational resources of the CPU will be just idle, waiting for data that are being read from the L2 at a slow rate. The associativity of the L1 cache won’t be increased, either, so it looks like we can’t hope for any performance miracles from the new processor in this respect.

The K8L will have a shared (among 4 or fewer cores) L3 cache. This cache will reside on the same wafer with the cores and will have a capacity of 2 or more megabytes. The L3 cache will most likely be exclusive, like the L2 cache. Combined with an enhanced crossbar, the L3 cache will solve the problem of low speed of transfers of modified data between the caches of neighboring cores which are performed via the memory bus in the K8 (for details see our article called X-bit's Investigation: Data Transfer Rate between the Cores in Dual-Core Processors). This problem is largely solved in the Conroe by sharing the L2 cache between the two cores. Thus, the quad-core K8L will most likely be close to the Conroe core in its inter-core data transfer characteristics. What’s curious, it is now reported that the future quad-core Intel processor won’t have its cache shared among the four cores. So, the same problems as we see now in a dual-core processor without a shared cache (an insufficiently high speed of data transfers necessary to maintain cache coherency) may appear in it, too.

Conroe-core processors have a well-developed caching system: a 32KB L1 cache with an associativity of 8 and a 2-4MB L2 cache with an associativity of 16 are linked with a full-speed 256-bit bus. The processor has highly efficient prefetch units that can aggressively load data not only from system RAM but also from the L2 cache into the L1 cache. It is the enhanced caching system that is largely responsible for the greatly increased results of the Conroe in SPEC INT (by 40% in some subtests) in comparison with the previous Yonah core.

To all appearances, there won’t be data prefetching from the L2 cache into the L1 cache and there won’t be high-efficiency prefetch units in the K8L due to the peculiarities of its caching system. If the mentioned deficiencies of the caching subsystem of the K8 core are not eliminated in the K8L, the new processor won’t be able to show its full potential and will be less efficient than the Conroe.

Besides the mentioned innovations, the K8L’s memory subsystem is going to be modernized in some other ways like support for upcoming DDR3 SDRAM and FB-DIMM, and for Hyper-Transport 3. But these improvements will hardly have a great impact on the performance of desktop computers.

Conclusion

The future processor from AMD, currently known as K8L, will be the next step in the evolution of the K8 series. The improvements we learned about so far such as increased to 32 bytes instruction sample, improved branch prediction algorithm, introduction of out-of-order reading, will eliminate a few bottlenecks and improve the integer performance. The expansion of SSE instruction sets will help improve the performance significantly in applications dealing with heavy floating-point or integer calculations using SSE instructions, where K8L will be able not only to compete successfully but even to outperform Conroe. The inability to decode and retire 4 commands per clock cycle in some cases may result into tangible performance gaps in integer applications. However, it may not be of that much importance in most cases, because the typical instruction execution pace in real integer applications does not exceed 2-2.5 instructions per cycle because of the data dependence. K8L performance may also be increased in other ways without raising the number of decoding and retirement pipes. Here I am talking about increasing the scheduler queue depth, reducing the number of false dependences (thanks to mechanisms similar to stack engine) and performing out-of-order reading skipping the writes. We do not know today if AMD will be able to implement these features in their new processor. We also do not know if they will modify the caching and prefetch system. However, this is very important for successful competition of the new K8L against Conroe processor in the entire range of tasks.

The successors to K8 architecture require a lot of work. They need not just to compete with Intel solutions, but to outperform them, and that is why their architecture needs to be free from the K8 bottlenecks and has to be able to decode and process up to four instructions per clock. Alas, developing and perfecting a new core is a tremendously laborious and time-consuming task. Even if a new architecture is under development at AMD, it will hardly become market-ready in the next couple of years. It’s hard to tell what this rapidly developing industry will have become like by that time!

The author would like to thank the following people:

Yan Keruchenko aka C@t - for constructive criticism and help with theoretical analysis
Sergey Zagursky aka McZag and Sergey Romanov aka GReY - for constructive criticism, valuable hints and suggestions
Maria Desyatnikova - for editorial help