Where Is High Performance Coming From?
There are a lot of unexpected (to say the least of it) microarchitectural changes that allowed Intel engineers to increase the performance of their processors while lowering their power consumption and heat dissipation. The thing is that Sandy Bridge has not just become further evolutionary development of Nehalem microarchitecture. It also borrowed a lot of ideas from the seemingly failed Pentium 4 project. Yes, even though Intel has long given up NetBurst microarchitecture because of its energy-inefficiency, some functional units of the Pentium 4 processors will now find their way into the new Core i3, Core i5 and Core i7 CPUs. And it is especially ironic that the adoptions in Sandy Bridge are used not only for raising the performance, but also for lowering the power consumption.
We start noticing significant changes in Sandy Bridge microarchitecture in the beginning of the pipeline already: when x86 instructions are decoded into simpler processor micro-ops. The actual decoder remained the same as in Nehalem – it processes 4 instructions per clock cycle and supports Micro-Fusion and Macro Fusion technologies that make the output instructions thread more even in terms of execution complexity. However, the processor instructions decoded into micro-operations are not just transferred to the next processing stage, but also cached. In other words, in addition to the regular 32 KB L1 cache for instructions that is a feature of almost any x86 processor, Sandy Bridge also has an additional “L0” cache for storing the decoding results. This cache is the first flashback from NetBurst microarchitecture, its general operation principles make it similar to the Execution Trace Cache.
The decoded micro-ops cache is about 6 KB big and can store up to 1500 micro-ops, which makes it of great help to the decoder. If the decoder discovers instructions that have been translated earlier and are now stored in the cache, it replaces them with internal micro-operations without performing any new decoding. This decoded micro-ops cache helps to take a big load off the decoder, which is a pretty energy-hungry part of the CPU. According to Intel, this additional cache comes in handy in about 80% of cases, which makes all suspicions about its inefficiency absolutely unjustified. Besides, when the decoder in Sandy Bridge is idle, it is disabled thus helping lower the CPU power consumption substantially.
The second important improvement in the early pipeline stages deals with the branch prediction unit. You can’t underestimate the importance of proper functioning of this unit, because each incorrect branch prediction requires stopping and clearing the pipeline completely. As a result, the prediction mistakes not only have a negative effect on performance, but also use additional power for filling up the pipeline all over again. I have to say that Intel managed to make this unit extremely efficient in their new processors. However, they modified all the Sandy Bridge buffers used to store branch addresses and prediction history in order to increase the data density in them. As a result, Intel is able to store longer branching history without increasing the size of the data structures used by the branch prediction unit. And that had a great effect on the branch prediction unit efficiency, which is directly connected with the amount of static data about executed branches that it works with. According to preliminary estimates, the branch prediction correctness in Sandy Bridge improved by more than 5% compared with the predecessor.
But it is the key unit of all Out-of-Order processors – the Out-of-Order cluster - that underwent the most interesting modifications. This is where Sandy Bridge and NetBurst microarchitectures seem to be the closest: Intel engineers brought back the physical register file into their new processors (if you remember they retired this file in their Core and Nehalem processors in favor of a centralized Retirement Register File. Before, when they rearranged micro-ops, they used to store full copies of registers for each operation in the buffer. Now they use links to register values stored in a physical register file. This approach allows not only to eliminate excessive data transfers, but also to prevent multiple duplication of the register contents thus saving space in the register file.
As a result, the out-of order cluster in Sandy Bridge processors can keep up to 168 micro-ops “in sight” at the same time, while Nehalem processors could store only 128 micro-ops in their ROB (reorder buffer). Besides, some energy is also being saved. However, replacing the actual values with the links to them also has its negative side: the execution pipeline gets new stages required for dereferencing the pointers.
However, the developers didn’t really have much of a choice in case of Sandy Bridge. These processors support new AVX instructions operating 256-bit registers, so transferring their values forth and back numerous times would inevitably create additional overhead expenses. But Intel engineers made sure that the new instructions in Sandy Bridge microarchitecture are executed fast enough. In this case high performance will guarantee that software developers will accept the new instructions, because only in this case they can really increase the parallelism and throughput in vector calculations.
AVX instructions are none other than further development of SSE, which increases the size of the SIMD vector registers to 256 bit. Moreover, the new instruction set allows non-destructive execution, i.e. when the original data in the registers is not lost. As a result, AVX instruction set, just like the microarchitectural improvements, can be considered an innovation increasing the performance and saving the power, because their implementation will allow simplifying many algorithms and using fewer instructions to complete the tasks. AVX instructions are quite fit for heavy floating-point calculations in multimedia, scientific and financial applications.
The processor execution units have been redesigned specifically to ensure that 256 bit instructions can be executed effectively. The major redesign had to do with pairing two 128 bit execution units in order to efficiently process 256 bit data packs. And since each of the three execution ports in Sandy Bridge processors (just like in Nehalem ones) has units for simultaneous work with three types of data – 64 bit, 128 bit integer and 128 bit real – it makes perfect sense to join SIMD units into pairs within the same port. And most importantly, this resources rearrangement doesn’t affect the bandwidth of the processor execution unit at all.
Since Sandy Bridge is designed to work with 256 bit vector instructions, the processor developers had to address the performance of the functional units responsible for data loading and unloading. Three ports designed in Nehalem for that purpose have migrated to Sandy Bridge. However, in order to increase their efficiency Intel engineers unified two of these ports that used to serve for storing addresses and loading data. Now they have become equal and can either load addresses and data or unload addresses. The third port remained unchanged and is designed for storing data. Since each port can let through up to 16 bytes per clock, the total throughput of the L1 data cache in the new microarchitecture increased by 50%. As a result, CPUs with Sandy Bridge microarchitecture can load up to 32 bytes of data and store 16 bytes of data per clock cycle.
If we compare all above described innovations, we will see that the microarchitecture of computational cores in Sandy Bridge processors has been modified more than significantly. These innovations are undoubtedly serious enough to be regarded as dramatic modifications rather than simple fixing of Nehalem’s bottlenecks.