Haswell Front End and Execution Cluster
Do you remember what Ivy Bridge looks like? Let me give you a hint:
Inside of the Haswell CPU core are things that are fairly standard for Intel. Haswell has the same exact structure and the same exact set of functional units inside. In other words, if we redo this block diagram for the upcoming Haswell, there will barely be any changes to it. The only thing we will have to do is add a few mentions of new instructions – Advanced Vector Extensions 2 (AVX2) and Transactional Synchronization Extensions (TSX).
As for the promised performance boost, it is guaranteed by a few internal modifications and optimization, which are not too dramatic, but provide a combined boost of about 10-20% in old applications and comparable increase in performance in some of the algorithms modified using Haswell’s unique instructions and features.
All the changes are consolidated in the core front-end. The execution pipeline remained the same, the L1 and L2 cache latencies also haven’t changed. However, Haswell boasts improved branch prediction, larger L2 TLB, larger buffers and Out-of-Order Window.
However, the most exciting innovation is the larger number of execution units.
Previous generation microarchitecture, including Ivy Bridge, has only 6 execution ports. Haswell acquired two additional ports. It means that theoretically the future processors will execute the code considerably faster, as they will theoretically be able to execute up to eight micro-ops simultaneously per clock. Of course, these instructions should be specifically selected, because the execution ports aren’t universal.
They added the fourth port for integer and logical instructions, which is a dedicated special port, and unlike the first three doesn’t get blocked during AVX instructions execution, for instance. As a result, Haswell makes it possible to execute up to four integer operations per clock. It is a very important improvement, because Intel’s processor decoder can deliver up to four-five instructions per clock to the execution units. In other words, they have completely eliminated a potential bottleneck in the new microarchitecture design.
They also introduced an additional Branch-unit, which should significantly improve performance with high branch code. They also added a special port exclusively for store address commands. This enabled Haswell to do 2 loads and a store every cycle.
Moreover, they offer two ports for floating-point (i.e. AVX2) operations. As a result, the peak performance during 256-bit commands execution via first two ports alone doubled compared with the previous generation processors. This modification was necessary because AVX2 instruction set includes principally new FMA-instructions (Fused Multiply-Add), which consist of two operations at the same time – multiplication and addition. Of course, executing those using old resources could cause significant delays, that is why they Intel provided two separate execution ports just for these instructions. As a result, Haswell allows executing two complex FMAs every cycle per core.
By the way, do not forget that AVX2 instruction set also supports integer operations with 256-bit vectors. They are performed by separate execution units.
Haswell’s performance during floating-point calculations should be very impressive. Twice the speed over Sandy Bridge and Ivy Bridge as well as over processors on Bulldozer microarchitecture achieved due to new FMA-instructions make Haswell a great “FP number cruncher”.
It is important to keep in mind that the code must be AVX2-optimized in order to enjoy the above described performance boost.
Note that Intel is very passionate about their AVX2 instructions. Most of the improvements in the new Haswell microarchitecture have been introduced to ensure that the new AVX2 instructions will work very fast. But why? Well, mostly because of the video content processing algorithms.
However, Intel believes that AVX2 is a strategically important developmental milestone. While the GPU developers are trying to take over the stage and position their graphics accelerators as the most suitable computational solutions, Intel is not ready to accept it just yet. As we can see, we are continuously adapting their processor design for high-performance computing and it looks like they might even introduce 512-bit SIMD extensions at some point in the future. Haswell already has a theoretical basis for that: two ports for 256-bit FP-instructions could be combined into one.