More Instructions per Clock?
Obviously, increasing the number of cores is not going to be an ultimate panacea. It has become clear back when AMD launched their Phenom II X6 processors, which were inferior to quad-core Sandy Bridge in performance. Therefore, AMD developers didn’t stop at extensive design modifications. The basic Bulldozer microarchitecture has been changed practically completely compared with K10, so there is hope that AMD systems will speed up not only in multi-threaded tasks, but also in less-parallel applications. And these hopes are backed up by some very objective evidence. While previous AMD microarchitectures were designed for processing up to three instructions per clock (on a single core), Bulldozer microarchitecture should be capable of processing four instructions per clock, which brings it very close to the competitor processor on Core microarchitecture.
We can see some quality changes at the very first stages of the execution pipeline: during instruction prefetch and decoding. These stages are shared between the pairs of cores within a single module, so AMD made sure that they didn’t turn into an architectural bottleneck. The instructions to be decoded are prefetched from L1I cache in 32-Byte blocks – twice as much as in second-generation Core based processors. The actual L1I cache for instructions is 64 KB big and has two-way associativity. The instructions to be decoded are preloaded speculatively into this cache from L2 cache memory.
Instructions are prefetched by the branch prediction unit containing two sets of buffers, which independently monitor activity of different cores. This way Bulldozer doesn’t “get lost” in the threads during branch prediction. Since the new microarchitecture is designed to work at high clock frequency, the quality of branch predictions is extremely important. Therefore, AMD have completely changed the branch prediction algorithms and now they hope that the branch prediction accuracy in the new Bulldozer will improve substantially.
Bulldozer’s x86 instruction decoder is also shared between the two cores and is capable of decoding up to four incoming instructions per clock cycle. However, its performance is limited by four macro-instructions (which are the result of the decoding process, in AMD’s terms), while x86 instructions may disintegrate into 1-2 or even more macro-instructions. So, even though the decoder has become 33% more effective compared with the previous generation microarchitecture, its performance may not be high enough to load optimally two integer clusters and one floating-point cluster.
I have to say that they also used some kind of a macro-fusion technology in the new Bulldozer. Certain groups of x86 instructions may join together and go through the decoder as a single instruction – AMD calls it Branch Fusion.
Decoded macro-instructions are then distributed to three computational clusters, two of which are the remainder of the fully-fledged computational cores and another one, the floating-point cluster, is shared by the cores. Each of these clusters has its own instructions reordering logics and its own scheduler. It obviously means that AMD can eventually fully replace or modify some of these clusters in their future products.
Instructions reordering process in each cluster is based on a physical register file. This file stores links to register contents and eliminates the need to constantly move data around inside the processor once instructions order changes. This approach replaced reorder buffer, because a physical register file is more energy-efficient and more tolerant to processor clock frequency increase.
Integer clusters each contain two arithmetic execution units (ALU) and two units for work with memory addresses (AGU). Unlike K10 microarchitecture, there are one ALU and one AGU less, but according to AMD, it shouldn’t severely affect the performance, while allowing to significantly reduce the die size. I can easily believe that: it doesn’t make much practical sense to have more than two ALU and two AGU per integer cluster, because the decoder can send no more than four macro-instructions per clock for execution by both clusters.
At the same time, the execution units have become more universal and barely differ in functionality.
The organization of the cache-memory system has changed dramatically. They lowered the size of L1D cache from 64 KB to 16 KB and made it inclusive with write-through. At the same time they increased its associativity to 4-way and added a “way predictor”. In order to make up for the serious reduction in size, they increased the bandwidth of the L1 data cache quite substantially, so that now it can process up to three 128-bit operations at the same time – two reads and one write.
The changes in the L1D cache bandwidth are obviously connected with the need to implement 256-bit AVX instructions, which are now supported in the shared FPU unit. However, it doesn’t mean that floating-point units have now become 256-bit. In reality there are four 128-bit units in a single Bulldozer module and AVX instructions are decoded as connected pairs of 128-bit operations. Therefore, floating-point multiply-accumulate (FMAC) blocks unite to execute them and the performance of the floating-point cluster drops to one AVX-instruction per processor module per clock cycle.
FPU doesn’t have its own L1 cache that is why this cluster works with data via integer units.
Since AMD engineers decided to implement Intel’s AVX-instructions support in their Bulldozer processors, they also added support for other current instruction sets, such as SSE4.2 and AESNI for encryption acceleration. Moreover, AMD also introduced a few instructions of their own: triple-operand multiplication and addition (FMA4) and their own unique vision of future AVX development – XOP.
Bulldozer’s L2 cache exists as a single unit inside the processor module and is shared by the cores. It is of impressive 2 MB size and has 16-way associativity. However, the latency of a cache like that increased to 18-20 clocks, while the bus width remained the same as before – 128 bit. It means that even though Bulldozer’s L2 cache is large, it is not particularly fast: the current competitors and predecessors have L2 caches with about half the latency. I have to say that together with a small L1D cache with 4-clock latency (which is also higher than in K10 microarchitecture) it doesn’t look too good. However, AMD insists that they increased the latency of their cache memory only to ensure that Bulldozer would be capable of running at higher clock speeds.
Moreover, AMD engineers implemented an efficient data prefetch unit capable of loading speculative data into the L1 and L2 caches. These units are claimed to be working much more effectively now and should be capable of recognizing irregular data structures.
Theoretically, Bulldozer looks very attractive. AMD have completely revised their old vision of processor microarchitecture and came up with a totally new design. And I have to say that this new design looks highly promising, because the new microarchitecture has been optimized for processing four instructions per clock instead of three in one processor core. Besides, it also supports macro-fusion of instructions before the decoding stage, which increases the effective performance even more.
But everything looks picture perfect only when we look at one core and do not take into consideration the fact that in reality such cores are combined into pairs. And the dual-core Bulldozer module has too many units shared between the cores. In particular, since a module like that has only one instruction prefetch unit and one decoder, the entire dual-core block can still process only four instructions per clock. And it means that in terms of theoretical performance it is a Bulldozer module, but not the actual core that would be considered a logical equivalent to a single core in Sandy Bridge processors. in this case the module’s ability to perform two threads looks like a pretty logical response from AMD to Intel’s Hyper-Threading technology.
Of course, our performance tests of the new processors will dot all i’s, but even at this point we can’t help thinking that Bulldozer’s positioning as an eight-core processor is more of a marketing move. In reality it is the number of modules that gives us a better idea of these processors’ computational potential. In respect to theoretical performance it seems more logical to compare these modules to cores in the terms of second-generation Intel Core microarchitecture.
Therefore, a logical question pops up: why did AMD decide to implement two-thread processing within a single processor module? Why couldn’t they simply combine the execution units in two different cores into a single cluster? There are several reasons for that.
First, they need advanced inter-processor logics to be able to load numerous execution units optimally. AMD, however, didn’t succeed with the implementation of highly efficient branch prediction and instruction and data prefetch. Therefore, it is the responsibility of software developers to deliver Bulldozer-compatible applications supporting multi-threading, which are well-paralleled and use the execution units optimally.
Second, larger number of simultaneously processed computational threads is a good thing. While desktop users and especially gamers will hardly benefit that greatly from eight fairly simple Bulldozer cores, this microarchitecture should be highly welcome in the server environment. So, it is quite possible that the primary goal for Bulldozer was regaining AMD’s leadership in the server market rather than making computer enthusiasts happy.