Brief Tour on Haswell Microarchitecture
Although the Haswell microarchitecture is supposed to represent the “tock” cycle, i.e. to introduce considerable innovations while retaining the same manufacturing technology, it doesn’t follow Intel’s established pattern. The Haswell is meant to optimize Core processors in terms of energy efficiency rather than high performance. Intel is going to use the new CPU design for mobile gadgets in the first place, traditional PCs and servers being of secondary importance.
The Haswell is versatile, of course. It can be used to make CPUs with conventional performance/power consumption ratios but they are only byproducts since the focus is on the new low-wattage Core CPUs. That’s why we are not really interested in the majority of innovations implemented in the Haswell microarchitecture. We can only tell you that the new CPU design allows Intel to aggressively market CPUs with a specified power of 6 to 15 watts. Such a low level of power consumption is achieved through various means, from optimized manufacturing process to the introduction of additional power-saving states that may disable certain CPU subunits even when the system is active.
There are but few innovations for reaching higher performance. The larger part of the Haswell is inherited from the Ivy Bridge. There are no changes in terms of basic CPU structure or execution pipeline whose length remains the same as before (14-19 stages).
The key improvement in the x86 core design is about the microinstruction execution stage. There are now a few new execution devices and two new execution ports which have additional subunits for processing integer instructions, branches and addresses. The higher parallelism helps solve two problems. First, the CPU has got four integer ALUs, so classic code can be executed at the rate of its decoding. Second, the microarchitecture is better suited for floating-point and FMA instructions which do not interfere with ordinary code and may make Hyper-Threading more efficient.
The second group of positive changes in the Haswell microarchitecture refers to the cache memory subsystem. The L1 and L2 caches have doubled their bandwidth while retaining the same latency. The L1 cache can execute two 32-byte reads and one 32-byte write per clock cycle. The L2 cache can receive and issue 64 bytes of data per clock cycle.
Furthermore, Haswell-based CPUs support AVX2/FMA3 instruction sets which expand the existing SIMD sets by introducing 256-bit instructions for processing integer vectors, full-width element permutes, gather and floating-point fused multiply-add operations (which include both multiplication and addition concurrently). AVX2/FMA3 instructions can be utilized for high-performance computing, gaming computations, video and audio processing, etc, to ensure higher speed of popular algorithms.
A few smaller innovations should also be mentioned. The Haswell features improved branch prediction, a larger out-of-order execution buffer, and a larger L2 TLB.
The resulting performance benefits are estimated at 5 to 10%. The clock rates of the new CPUs do not promise much in terms of speed, either. Considering that the manufacturing technology has not changed, Haswell CPUs are likely to have the same frequency potential as the Ivy Bridge series.
We use SiSoftware Sandra 2013 SP3, a synthetic benchmarking suite, to check out various performance-related aspects of different CPU designs. And we are going to compare the Core i7-4770K (Haswell) with the quad-core Core i7-3770K (Ivy Bridge) as both have the same clock rates: 3.5 GHz by default and up to 3.9 GHz in Turbo mode.
The results are not encouraging. Unless the application uses the new AVX2/FMA3 instructions (and today’s software doesn’t yet support them, of course), the new microarchitecture doesn’t offer any performance benefits. It is a mere 2-3% faster with the simple algorithms Sandra 2013 uses to benchmark performance. Well, even this improvement should be appreciated considering Intel’s current priorities and the lack of competition among top-performance x86 CPUs. And if the new instructions are indeed get implemented everywhere, the Haswell may become much better than its predecessor, ensuring a performance boost of 30-40%. It’s up to application developers now to make use of that advantage.
Well, we can still hope that the Haswell will be somewhat faster in real-life applications in any case. And this hope is based not on any improvements in its microarchitecture but on the higher bandwidth of the L1 and L2 cache. This can be easily seen in any specialized benchmark like our Sandra 2013 SP3. The test was performed on platforms equipped with DDR3-1866 SDRAM (9-11-9-27-1T timings).
So indeed, the L1 and L2 cache memory works much faster in the Haswell than in its predecessor. This is the key advantage of the new microarchitecture which may make it faster than the Ivy Bridge in real-life applications. On the other hand, the L3 cache and memory controller are somewhat slower, which may have a negative effect on performance. To enable individual control over power-saving states of the uncore part of the CPU, the clock rates of the L3 cache and memory controller are not linked to the clock rate of the x86 cores. And even though these subunits work at frequencies which are similar to the x86 cores frequency, their performance is lower as a tradeoff for asynchronous operation.
Summing everything up, we can say that the Haswell microarchitecture can hardly lift the performance of Core CPUs to a new level. The improvements it brings about ensure but a small increase in speed, derived mostly from the increased cache memory bandwidth rather than from any changes in the execution pipeline. Theoretically, the Haswell can show its best with AVX2/FMA3-using code, but software developers don’t seem eager to write such code even though some of those instructions are already supported by AMD processors as well.