The new microarchitecture of the Kaveri’s x86 cores is perhaps the most intriguing innovation this APU brings about. When AMD’s previous high-performance designs, Bulldozer and Piledriver, had failed to catch up with Intel’s Core series in performance, the Steamroller came to be regarded as the next attempt for AMD to deliver a truly competitive top-end product. AMD was expected to get rid of the key downside of the earlier designs, namely their low single-threaded performance.
Well, even if the Steamroller is indeed a big step forward compared to its predecessors, it is no breakthrough. AMD will not implement it in fast multicore processors, limiting it to the quad-core Kaveri which is positioned as an affordable integrated solution. AMD claims that the new microarchitecture can improve performance by about 20% over the Piledriver at the same clock rate. The Steamroller’s more complex design and mobile targeting means that its top clock rate is lower, though, and the practical performance benefits are rather small, too. Even the introduction of the more advanced 28nm tech process doesn’t save the day for AMD.
Thus, the Steamroller should be viewed as an improvement on the previous Bulldozer and Piledriver microarchitectures, judging by performance as well as internal design. AMD keeps on optimizing its basic microarchitecture in small steps, building on the Bulldozer foundation. As in the earlier designs, the Steamroller features dual-core x86 modules with a shared 2MB L2 cache per each module. There are no innovations in terms of instruction sets. The Steamroller doesn’t support AVX2.
It is the distribution of resources shared by the cores within a single module that has been revised. In the original Bulldozer concept, there were quite a lot of single functional subunits within each dual-core module including instruction fetch and decode subunits, a floating-point unit, and cache memory. It helped make the semiconductor die less complex, reducing its heat dissipation and enabling multicore processors with a rather high frequency potential. The downside of that concept was that the shared resources would become a bottleneck under multithreaded loads. Practice suggests that the instruction decoding stage was the most performance-limiting bottleneck, so AMD doubles the number of decode units in the Steamroller microarchitecture.
Now each core in the dual-core module has a dedicated decode unit capable of processing up to four x84 instructions per clock cycle. Instructions are still fetched by a shared fetch unit whose efficiency and performance have been improved in other ways. Particularly, branch prediction algorithms have become better by using larger buffers. The size of the shared L1 instruction cache has been increased from 64 to 96 kilobytes. The cache itself is now 3- rather than 2-way associative.
It must be noted that the double number of decode units and the other optimizations are only meant to get rid of the main bottleneck in the microarchitecture. We can’t expect the Steamroller to deliver double performance. There are still a few bottlenecks on the instruction fetch and execution stages which are only going to be dealt with in the next microarchitecture revision codenamed Excavator.
Besides the mentioned changes in the front part of the execution pipelines, the Steamroller only features some minor improvements that don’t affect its performance much. For example, the FPU’s execution subunits have been balanced out to optimize their load level. The interface between the L1 and L2 caches now ensures faster data transfers. Some of the Steamroller improvements are only meant to achieve better energy efficiency: the L2 cache is split up into four independent sections each of which can be turned off to save power whereas the decode units now have a micro-op queue which, when full, triggers the shutting-down of those units, too.
The higher performance of the Steamroller microarchitecture goes hand in hand with higher complexity. The transistor count per each dual-core module has increased by over 60% upon the Piledriver to Steamroller transition due to microarchitecture improvements and new automated methods of semiconductor chip integration. Thus, the Steamroller seems to diverge from the original concept of making processors out of a number of high-frequency and low-complexity cores. In practical terms, it is reflected in AMD’s unwillingness to apply the Steamroller to multicore FX series products.
Anyway, AMD promotes the Steamroller optimistically, emphasizing its advantages and glossing over their tradeoffs. According to the official data, the hit rate for the L1 instruction cache is improved by 30%. The branch prediction rate is improved by 20%. The overall scheduler efficiency is 5 to 10% higher now. All of this helps optimize the load level of the execution devices by about 25%.
Of course, such claims must be checked out in practice. So we are going to compare the actual performance of quad-core Richland and Kaveri-based processors (based on the Piledriver and Steamroller microarchitecture, respectively) clocked at the same frequency of 4.0 GHz in the synthetic benchmarks of the AIDA64 4.30.2907 utility. We also throw in a quad-core Haswell clocked at 4.0 GHz with Hyper-Threading turned off. The Richland’s results serve as the baseline for better readability.
The picture is rather gloomy for AMD. For all their efforts, AMD developers have not been able to ensure a substantial increase in performance. The Steamroller is a mere 10% faster than the Piledriver on average. There are even scenarios where the new microarchitecture is slower, like in the Queen benchmark which focuses on branch predictions and penalties associated with wrong predictions. This raises some doubts about AMD’s claims that the input section of the execution pipeline has been optimized.
It is the hashing benchmark that shows the highest performance benefits from the Steamroller microarchitecture. The benchmark uses the standard SHA1 algorithm with integer vector instructions.
The diagram shows the gap between AMD and Intel solutions, too. There’s a twofold difference between the Kaveri- and Haswell-based processors working at the same clock rate and having the same number of x86 cores. AMD’s new microarchitecture doesn’t even try to compete in terms of sheer speed, so the quad-core Kaveri can only be viewed as a competitor to the dual-core i3 models when it comes to computing performance.
Now let’s check out the floating-point performance.
The Kaveri is an average 6-7% ahead of the Richland in this test, both processors working at the same clock rate. The AMD processors are very slow in comparison with the Haswell, which might be expected as the quad-core Richland and Kaveri-based products only incorporate two FPUs.
Thus, the Kaveri series is just as bad in terms of x86 performance as its predecessors. Whatever claims AMD may make about their innovations, they cannot compete with Intel’s quad-core solutions. We’ll talk about the practical performance of Kaveri APUs in popular applications shortly. Right now, let’s discuss the integrated graphics core. AMD is much better at it than at developing x86 cores.