4th Generation of Core Microarchitecture: Intel Haswell

In 2013, 4th generation Intel Core processor family based on "Haswell" microarchitecture will bring faster, thinner, lighter, cooler, more secure systems with built-in graphics to mainstream.

by Anna Filatova
09/12/2012 | 12:23 AM

The most interesting and exciting topic of the Intel Developer Forum 2012 that kicked off in San Francisco this Tuesday was, of course, the unveiling of the new Haswell microarchitecture, which Intel is going to introduce in the next generation of their processors.

 

Intel Corporation's chief product officer described how its low-power processors, starting with the company's 4th generation Intel Core processor family available next year, will set a new standard for mobile computing experiences and innovative Ultrabook, convertible and tablet designs.

Intel is working really hard on improving the mobility of the devices with their microarchitecture inside. In his morning keynote presentation Davis (Dadi) Perlmutter dwelled specifically on this particular aspect of the technology evolution. About a decade ago the notebook segment got a big push making ultimate mobility one of the key objectives. However, the demand for mobility today and the extent of its expansion have grown much bigger. But nevertheless it is just the beginning. People are used to using computers as part of their everyday life and want to continue doing so while moving around. They want the opportunity to use the data, watch videos, and run applications without any limitations. So, last year Intel introduced a new device concept - ultrabooks, and a few months ago they rolled out their third-generation core manufactured using 22 nm process. But it doesn’t stop here.

The next generation Haswell microarchitecture based on 22 nm process is coming next year. This microarchitecture was designed with mobility in mind. One of the things they did was cutting 20X off the item power compared to Sandy Bridge 2nd generation core technology. They also designed the new generation microarchitecture taking into account advanced power management functionality, frameworks, and the new Windows 8 OS. At the Intel Developer Forum in San Francisco David (Dadi) Perlmutter demonstrated that Intel indeed reduced the platform idle power of its 4th generation Intel Core processor family based on the next-generation "Haswell" microarchitecture by more than 20 times over the 2nd generation while delivering outstanding performance and responsiveness.

Let’s take a closer look at the new upcoming Haswell microarchitecture so that we could get a better idea of what to expect, especially, since there were quite a few deep dive sessions at the forum dedicated to Haswell and the upcoming processors based on it.

Haswell Philosophy

Intel continues to stick to their “tick-tock” strategy. The current Ivy Bridge processor generation is a “tick”, which represents the transition of the existing microarchitecture to the new 22 nm production process. Therefore, Haswell will be a “tock”. It means that the production process will remain the same, using 22 nm technology and Tri-Gate transistors, but all engineering effort will be focusing on architectural modifications and enhancements. So, Haswell is a huge step forward in terms of changes in the processor logical structure. By the way, the launch of Haswell will also mark the arrival of a completely new platform.

Haswell represents an entire family of products, however the span of the family for Haswell is larger than the span we had for prior products. Last year at the IDF 2011 Intel talked about Haswell on a very high level. This time they revealed more details on how they were able to improve power and performance.

So, first let’s talk about the Haswell philosophy. Of course, the starting point was the Sandy Bridge and Ivy Bridge generation that came immediately before that. Turbo mode, ring interconnect, hyper-threading, etc. - all these features are carrying over to Haswell. Of course, Intel made some improvements to all of them, but the basics remained the same. When designing a product like Haswell, the converged core, namely a single microarchitecture that scales from tablet to server is important. These are the three major pillars in Haswell microarchitecture:

Intel’s design philosophy that should be carried on into the next product generations implies that they will use unified design for multiple diverse applications, from servers to tablets. Intel managed to significantly lower the power consumption of the Haswell core, which allows for such immense design flexibility. Ultra mobile processors and tablet processors will have few cores and low frequencies and will be extremely energy-efficient. In the desktop segment they will be shooting for higher performance achieved due to larger number of cores and more conventional TDP. And the server segment will accommodate multi-core CPU modifications.

The modularity aspect implies that it is possible to create products that span across a very wide range of products. It is a great feature that is elevated to a completely new level in the new Haswell microarchitecture. Previously, Ivy Bridge and Sandy Bridge allowed to play around with rather few combinations of cores and two graphics core models. Now there will be more combinations available, at least because there will be minimum three graphics core modifications.

To give you an idea of what the new Haswell microarchitecture will be like, Intel singled out several major knobs that drive performance and industry adoption:

Haswell Front End and Execution Cluster

Do you remember what Ivy Bridge looks like? Let me give you a hint:

Inside of the Haswell CPU core are things that are fairly standard for Intel. Haswell has the same exact structure and the same exact set of functional units inside. In other words, if we redo this block diagram for the upcoming Haswell, there will barely be any changes to it. The only thing we will have to do is add a few mentions of new instructions – Advanced Vector Extensions 2 (AVX2) and Transactional Synchronization Extensions (TSX).

As for the promised performance boost, it is guaranteed by a few internal modifications and optimization, which are not too dramatic, but provide a combined boost of about 10-20% in old applications and comparable increase in performance in some of the algorithms modified using Haswell’s unique instructions and features.

All the changes are consolidated in the core front-end. The execution pipeline remained the same, the L1 and L2 cache latencies also haven’t changed. However, Haswell boasts improved branch prediction, larger L2 TLB, larger buffers and Out-of-Order Window.

However, the most exciting innovation is the larger number of execution units.

 

Previous generation microarchitecture, including Ivy Bridge, has only 6 execution ports. Haswell acquired two additional ports. It means that theoretically the future processors will execute the code considerably faster, as they will theoretically be able to execute up to eight micro-ops simultaneously per clock. Of course, these instructions should be specifically selected, because the execution ports aren’t universal.

They added the fourth port for integer and logical instructions, which is a dedicated special port, and unlike the first three doesn’t get blocked during AVX instructions execution, for instance. As a result, Haswell makes it possible to execute up to four integer operations per clock. It is a very important improvement, because Intel’s processor decoder can deliver up to four-five instructions per clock to the execution units. In other words, they have completely eliminated a potential bottleneck in the new microarchitecture design.

They also introduced an additional Branch-unit, which should significantly improve performance with high branch code. They also added a special port exclusively for store address commands. This enabled Haswell to do 2 loads and a store every cycle.

Moreover, they offer two ports for floating-point (i.e. AVX2) operations. As a result, the peak performance during 256-bit commands execution via first two ports alone doubled compared with the previous generation processors. This modification was necessary because AVX2 instruction set includes principally new FMA-instructions (Fused Multiply-Add), which consist of two operations at the same time – multiplication and addition. Of course, executing those using old resources could cause significant delays, that is why they Intel provided two separate execution ports just for these instructions. As a result, Haswell allows executing two complex FMAs every cycle per core.

By the way, do not forget that AVX2 instruction set also supports integer operations with 256-bit vectors. They are performed by separate execution units.

Haswell’s performance during floating-point calculations should be very impressive. Twice the speed over Sandy Bridge and Ivy Bridge as well as over processors on Bulldozer microarchitecture achieved due to new FMA-instructions make Haswell a great “FP number cruncher”.

It is important to keep in mind that the code must be AVX2-optimized in order to enjoy the above described performance boost.

Note that Intel is very passionate about their AVX2 instructions. Most of the improvements in the new Haswell microarchitecture have been introduced to ensure that the new AVX2 instructions will work very fast. But why? Well, mostly because of the video content processing algorithms.

However, Intel believes that AVX2 is a strategically important developmental milestone. While the GPU developers are trying to take over the stage and position their graphics accelerators as the most suitable computational solutions, Intel is not ready to accept it just yet. As we can see, we are continuously adapting their processor design for high-performance computing and it looks like they might even introduce 512-bit SIMD extensions at some point in the future. Haswell already has a theoretical basis for that: two ports for 256-bit FP-instructions could be combined into one.

Cache Subsystem

Doubling the FLOPs is certainly great, but the challenge here is that you need to be able to feed these execution units. Therefore, Intel made significant effort to improve their cache hierarchy. Internal cache structure and cache size remained the same, but they changed the bandwidths to the caches. It was done mainly to ensure that the cache speed is adequate to the high speed of AVX2 instructions execution in the core. The read and write ports in Haswell’s L1 cache are 32-byte (256-bit) wide. So, Haswell can do two reads and one write per clock, all 32-byte wide at the same time. They also removed restrictions around banking. So, overall, the improvements in L1 cache include the following: double the bandwidth, eliminated back conflicts, significantly improved L1 cache line split latency.

They have already made the L2 cache bus wider, so that now it can receive up to 64 bytes of data per clock cycle, which is twice as much as it can do in Sandy Bridge and Ivy Bridge.

The improvements in cache-memory performance deal only with the bandwidth, while the latency remains the same as before. Moreover, Intel hasn’t yet revealed any details about the size of the L3 cache memory.

As we know at this point the size of Haswell’s L3 cache will depend on the number of cores, and its internal structure will remain the same as before and will include uni-directional Ring Bus with two stops for each core, which we are very well familiar with from CPUs on Sandy Bridge and Ivy Bridge microarchitecture. However, its throughput should be increased due to the fact that data and non-data requests will be processed separately. Besides, they also promised to further optimize the memory controller, which should become more “bufferized” and therefore should guarantee higher write speed.

Transactional Synchronization Extensions

AVX2 is not the only new instruction set introduced in the Haswell microarchitecture. Intel has also developed Transactional Synchronization Extensions (TSX) that add hardware transactional memory support.

Intel want to make sure that it is easy for developers to write parallel code. TSX provides two software interfaces for designating code regions for transactional execution. Hardware Lock Elision is an instruction prefix-based interface designed to be backward compatible with processors without TSX support. Restricted Transactional Memory is a new instruction set interface that provides greater flexibility for programmers. TSX enables optimistic execution of transactional code regions. The hardware monitors multiple threads for conflicting memory accesses and aborts and rolls back transactions that cannot be successfully completed. Mechanisms are provided for software to detect and handle failed transactions.

We want to make sure that it is easy for developers to write parallel code. TSX provides two software interfaces for designating code regions for transactional execution. Hardware Lock Elision is an instruction prefix-based interface designed to be backward compatible with processors without TSX support. Restricted Transactional Memory is a new instruction set interface that provides greater flexibility for programmers. TSX enables optimistic execution of transactional code regions. The hardware monitors multiple threads for conflicting memory accesses and aborts and rolls back transactions that cannot be successfully completed. Mechanisms are provided for software to detect and handle failed transactions.

Power Consumption Improvements

When talking about the Haswell processor generation, Intel doesn’t hide the fact that they primarily focused on the interests and needs of the mobile systems users. In other words, lowering the power consumption and heat dissipation were practically the major objectives all the way. And they have indeed done everything possible to achieve maximum power and heat reduction: maximum optimization of the semiconductor design and process technology, improvement of the core and uncore, addition of new software controlled power-saving states. 

The optimized production process will obviously allow lowering the semiconductor die power consumption as a whole. However, in Haswell they have fine grained power control. The unutilized parts of the processor will be simply disabled in a very aggressive manner. Moreover, the cores, L3 cache memory and processor integrated graphics will work at different frequencies, which will be adjusted individually depending on the type of performed tasks.

Power-saving states have also been completely refreshed. By disabling “unutilized” units, Intel managed to lower the processor power consumption in idle mode and also improve the transition times from idle to active mode significantly. Namely, they have improved the existing C-states and added new deeper C-states and sped up the transition between them by up to 25%.

The particularly cool feature is the new S0xi state, in which the processor idle mode power consumption has become at least 20 times lower than what the previous generation processors had to offer. Moreover, there should be no negative side effects upon transition from this state into the active mode.

According to Intel, S0ix, will allow Haswell to find its way to tablets and smartphones. It is true, the S0xi state combines the advantages of the S0 and S3/S4 states lowering the power consumption to the minimal level of hundreds of milliwatts without requiring too much time to recover from this state. While in this state, the OS and applications think that the platform is active but the achieved power levels are the same as were previously associated with the sleep state. It offers completely different state of idle power and the responsiveness is always on. In other words, it is the best of both worlds. This is where we get tremendous improvement in battery life and transition times become shorter and we achieve lower power states much quicker. It is continuous, fine-grain (at the smallest levels) and transparent to well-written software. So far Intel found that most software works just fine with it the way it is.

Haswell also has improved Turbo Boost technology. Intel now can do better load balancing to the power side of the chip to extend their Turbo range. But it is about doing a much finer-grain control for each of the units on the die, where you do not waste power on performance.

Moreover, the platform itself has been optimized for maximum energy-efficiency. The link between the CPU and the chipset has been optimized for power. And that is far not all that has been done: Intel encourages its partners to address the energy-efficiency aspect carefully, too. Namely, they have been working closely with the vendors and did a new power allocation for idle states. It allows the manufacturers to meet power goals on the platform level. Intel also worked very hard on optimizing controllers, voltage regulators (efficiency improvements), and bringing new innovations to the platform in terms of power architecture. They have added a number of low-power IO standards, which gave OEMs and ODMs flexibility in terms of picking the devices with the best suited power characteristics.

So, a lot of really exciting innovations got in the Haswell platform in terms of power efficiency and power management. Intel is developing a 10W TDP processor for even greater battery life and reduced heat, which leads to thinner as well as lighter Ultrabooks. Due to reduced watt-usage the performance level won't be as strong as from the Ivy Bridge CPU at 17W.

The Windows 8 does have fully-fledged support of all power-saving innovations, so their actual adoption should go quickly and painlessly.

Graphics Core

Graphics core is a completely different story. It has also undergone some significant modifications, as Intel has been very serious about equipping their processors with some powerful graphics. Haswell is expected to double the graphics performance vs. Ivy Bridge processors bringing its performance on par with the $50 - $70 graphics cards.

The general architecture of Haswell graphics is the same as that of the graphics core in Ivy Bridge processors. Just like with their computational core, Intel works on improving individual units and increasing the number of execution units, which became possible due to modular core structure.

At the same time, the next generation Broadwell processors will most likely feature a completely new graphics core.

This is 4th generation integrated graphics. This round they added further extensions to the APIs, similar to those in Ivy Bridge with enhanced performance making it very much software and driver compatible machine. Of course, they have done some work on graphics core power efficiency as well.

However, they have also introduced new approach to creating GPU cores with different performance. Previously, they designed one powerful core and then rolled out its “lite” version with a cut-down number of execution units. Now there are three graphics core modifications. The top one doesn’t just have increased EU, but has duplicated general and shader domains.



Moreover, to ensure that the power consumption is managed in the most optimal way, the GPU has been totally decoupled. Now you can have the CPU and the graphics core working at low voltage and low frequency, which in the end produces a very flexible engine. And GT3 SKU may go into a much higher performance segment, or a higher power efficiency segment.

The four slides above give a perfect idea of what the most feature-rich GT3 Haswell GPU microarchitecture will be like. As we can see, the pipeline now has a resource streamer. By offloading some work on the driver, more power and frequency can now be allocated for the graphics. This allowed to minimize the driver overhead substantially.

Other than that, the GT3 graphics modification simply has doubled everything. The performance of the classical rendering pipeline has been doubled. Rasterization, Z-buffer, and Stencil buffer have been doubled, too. The execution units are now twice as many. As a result, Intel Haswell GT3 graphics core is twice as fast the top Ivy Bridge GPU. Of we can look at it differently: Haswell is capable of providing the same graphics performance as Ivy Bridge, but at only half the power.

Of course, there are also come connectivity improvements, too. The new graphics core will support three independent monitors connected via digital HDMI and DisplayPort interfaces. In this case the maximum supported resolutions will be 4K x 2K.

Media Engine

Haswell media engine is yet another way of lowering the processor power consumption. Instead of video encoding and decoding in the computational core, they now utilize special engine that needs much less power. And keeping in mind that video processing is currently one of the most popular type of user activity, this approach is totally reasonable. This is exactly why Intel pays such close attention to media in general and Quick Sync in particular.

Haswell supports an extended range of formats for video encoding and decoding besides the peviously supported formats. Among the new ones are native MVC short format, MJPEG decode and hardware decode acceleration of SVC (Scalable Video Coding). Haswell also acquired native support for large resolution content – up to 4Kx2K (for example, 4096x2304, 4096x2160 and 3840x2160). I have to say that this is a very timely addition. The ecosystem for resolution like that is already shaping up: Sony has recently announced a new TV-set supporting these resolutions, YouTube also supports resolutions like that.

In codec space Intel has been continuously improving the codec engine. They introduced hardware based SVC (scalable video coding), which is a derivative of AVC H.264. They have also implemented support for MPEG2 encode. And this time they claim that the encoder has become true low-latency and can be used for video conferencing.

The encoding speed is also expected to improve. Quick Sync in Ivy Bridge is faster than in Sandy Bridge, and it should be even faster than that in Haswell. Of course, its performance will depend on the graphics core model, but according to Intel, the GT3 core should transcode a 2-hour movie in a few minutes.

They have also paid special attention to the video quality aspect. There are new functions in the engine that may be applied during video encoding process. Haswell is claimed to produce better video quality than Ivy Bridge even at identical bitrate levels.

They have added new effects to the list of functions implemented in the media engine, which can be applied to video processing.

Of course, the media engine also has some power-saving features that make Haswell processors more energy efficient than their predecessors during video encoding/decoding.

Conclusion

At the last year’s IDF Intel revealed Ivy Bridge. Frankly speaking, back then they were more detailed and down-to-earth. This time we heard the story of Haswell, but it was mostly recited off the presentation slides. Yes, we did see a working concept system at the Tuesday keynote, but there was only one and obviously it was far from final. And there was no mention of the silicon at all. We didn’t see the actual chip, and only heard some general things about the microarchitecture, without any details on the specific SKUs. What does it mean? Has Intel become more secretive? Not at all. It is simply the absence of real competition in the high-end computing segment that took away the need to keep up the fast pace. Haswell will not hit the market in the beginning of the year, which is Intel’s traditional “big launch timeframe”. And when we asked “when?”, we heard a very vague response about “some time in mid 2013. So, it means that they are still quite far from the final stages of Haswell launch-readiness.

However, what we learned yesterday made us very optimistic about the future. The progress continues, the microarchitecture gets further improved and perfected. However, most Intel’s efforts are targeting not the increase in the processors computational performance. The primary focus is on lowering the power consumption and speeding up the graphics core. In other words, the focus on the mobile segment is more than obvious. This isn’t such good news for the enthusiast desktop users. It means that we shouldn’t expect a substantial performance boost with the launch of the new Haswell processors. 10-20% in typical applications compared with the Ivy Bridge based systems working at the same frequency is most likely as good as it gets. Clock frequencies also shouldn’t increase dramatically: the production process is the same, and there will be no changes to the execution pipeline. So, the desktop Haswell will most likely look more like a regular evolutionary refresh of the old design. And it won’t be the best refresh, because it will bring the new LGA 1150 platform with it.

Nevertheless, in the light of nothing super-innovative in the desktop processor market, we are still very excited to see the new Haswell. It does have a lot of appealing features. We would really like to see the new enhanced overclocking and how it will handle the integrated voltage regulator. We also hope that software developers will not avoid new AVX2 instructions and new media functionality of the Haswell processors, because these features may make new processors very popular and broadly adopted solution.