Piledriver Microarchitecture: Is It Really Better Than Bulldozer?
In our previous Trinity review we discussed the architecture of the Devastator graphics core and arrived at the conclusion that the change towards VLIW4 architecture was a very positive move. Now it is time to talk about its computing cores. Compared with Llano, computing cores also underwent significant changes. Instead of the x86 Husky cores with Stars microarchitecture, they now use modules with Piledriver microarchitecture, which is yet another iteration in the Bulldozer evolution. As you know, with the introduction of Bulldozer AMD changed their priorities dramatically. Unlike Stars, this microarchitecture reduced the number of instructions per clock, but allowed reaching higher clock speeds. However, far not everyone was happy about this outcome therefore three quarters after the first versions of Bulldozer processor shit the streets, AMD prepared a microarchitectural refresh – the “corrected” Piledriver.
Trinity processors use Piledriver cores, which is the first time this microarchitecture goes public. AMD believes that the made improvements should be enough to make Trinity noticeably faster than Llano. Does it mean that the new computational cores will allow AMD to successfully compete against Intel’s products? This matter is particularly acute because in about three-four weeks AMD will release new FX processors with similar Piledriver cores inside. And while with Trinity processors it is still possible to claim that their performance is “quite sufficient” by hiding the actual x86 speed behind the high graphics performance, the same trick will not work with the new FX processors. Therefore, the first thing we would like to investigate is how superior the new Piledriver is over the “classical” Bulldozer microarchitecture.
However, do not pin too many hopes on the new Piledriver. In structural terms, this microarchitecture is exactly the same as Bulldozer, i.e. consists of relatively dual-core modules with two sets of integer execution units, while some of the resources are shared between the two cores. Among these shared resources are cache-memory, instructions fetcher, instructions decoder and floating-point unit. As a result, the module can process two threads simultaneously, but its peak performance is capped by the throughput of the shared decoder, which can only decode no more than four instructions per clock per two cores. For your reference: Intel Core processors have a decoder with comparable performance, but it is individual for each core in the processor. It means that the number of instructions Piledriver can process per clock couldn’t increase dramatically. The real changes will occur only in the next generation of the microarchitecture aka Streamroller: supposedly, AMD will provide an individual instructions decoder for each core of their dual-core modules. So far, all improvements in the new Piledriver are based on optimizations on the operational algorithms in individual internal modules, but do not affect the design as a whole.
According to AMD, the major improvements in the Piledriver design are the following:
- Improved branch prediction precision due to the use of Hybrid Predictor augmented with 2nd level predictor;
- 128 and 256-bit FMA3 instructions extensions (fused multiply-add) and F16C SSE5 instructions extensions (half-precision floating-point conversion);
- Optimized schedulers;
- Accelerated division by modifying a corresponding execution unit;
- Increased L1 TLB;
- Improved L1 and L2 pre-fetchers that can work with variable length patterns, including those on page boundaries;
- Improved L2 cache efficiency by more aggressive removal of the unused data, which the pre-fetcher algorithms loaded into the cache by mistake.
All above listed improvements cannot speed up instructions decoding, but nevertheless, they do accelerate things a little bit. In order to estimate how efficient Piledriver is compared with the predecessor, we carried out a short practical test session. We will be comparing the new quad-core A10-5800K processor with Piledriver microarchitecture against a quad-core FX-4170 processor with Bulldozer microarchitecture. For a more illustrative result, both processors were working at 4.0 GHz frequency and their Turbo Core technology was disabled for the time of tests. Note that unlike A10-5800K with two-level cache-memory, FX-4170 has an 8 MB L3 cache, which cannot be disabled. Therefore, we will simply keep in mind that the Bulldozer based processor had a slight advantage. Both systems were equipped with DDR3-1867 SDRAM with 9-11-9-27-1T timings and Nvidia GeForce GTX 680 graphics card.
First let’s check out the memory sub-system performance in Cache & Memory Benchmark from Aida64 suite.
As we can see, A10-5800K processor doesn’t do that well here. Bulldozer provides higher practical bandwidth and lower latencies. However, it is not because of the shortcomings of the Piledriver microarchitecture. In reality, we compare processors working in two different platforms. Trinity has been optimized to ensure that the memory can be shared efficiently between the computing and graphics cores. More complex algorithms of the DDR3 SDRAM controller, which require additional requests priority arbitration, cause certain delays making Trinity yield to Bulldozer in tests. Unfortunately, even if the Socket FM2 system is equipped with a discrete graphics card and the graphics core integrated in the APU is not used, Trinity’s x86 cores still do not work with the memory fast enough.
Now let’s take a look at the computing performance:
As we can see from the obtained results, Piledriver microarchitecture is just a little faster than Bulldozer from the practical prospective. The highest performance advantage is only 7%, and on average the new design is only 1.5% faster in the benchmarks above. However, it is important to keep in mind that the Piledriver model we tested didn’t have an L3 cache and had a slower memory controller. This is exactly why we see its performance drop in some benchmarks that work intensively with large data volumes. However, we do not think that the processors with new microarchitecture for Socket AM3+ form-factor will change this situation dramatically. The number of instructions they can process per clock cannot really increase that is why 5-10% performance boost is probably as good as it can potentially get when new Vishera processors come out.