by Ilya Gavrichenkov
10/11/2011 | 11:00 PM
No doubt that new AMD processors based on Bulldozer microarchitecture are one of the most highly anticipated products not only of the year, but at least of the last five. There are several reasons for that as well as for the fact that AMD products have so many fans. I am sure some of you remember the times when AMD processors were better than Intel ones in all aspects. Some users like AMD products for the balanced combination of price and performance that they have to offer. And some may have been carried away by the passion, with which they talked about the advantages of the new microarchitecture they have been working on. All this combined with the years of waiting for the new Bulldozer processor generation produced a pretty logical outcome: you are reading this review with great interest and excitement.
However, the wait was definitely worth it. The success of the Bulldozer processor microarchitecture will determine the situation in the processor market over the next few years. At this time only Intel has sufficient engineering resources and production capacities to roll out new microarchitectures every two-three years. As for AMD, they have to stick to a much more reasonable tempo. It may strike you as somewhat scary, but the microarchitecture currently used in Phenom II and Athlon II processors goes as far back as 2003. There have only been minor “cosmetic” changes made to it since then. Therefore, we do not really expect the launch of Bulldozer processors to speed up the development process for AMD. No doubt that the new Bulldozer will be the basis for all high-performance AMD products for the next few years.
The current version of the company roadmap exploits this microarchitecture up to 2014, but it will most likely continue its existence beyond that point.
AMD’s promise to deliver at least 10-15% performance boost each year is more of a cause for concern rather than optimism. Clock frequency increase will most likely be the primary way of boosting the performance, with microarchitectural improvements being more in the background.
In other words, the success of Bulldozer microarchitecture today will have a life-changing effect on the AMD’s future, will determine how competitive their products will be and in the end what will happen to the processor market in general.
Of course, we can’t deny that Bulldozer is not the only key product for AMD. This microarchitecture is currently positioned for high-performance desktop and server systems. At the same time, AMD has other products for the rest of the market. For example, low-cost energy-efficient processors on Bobcat microarchitecture launched earlier this year or Llano APUs are just as important for the company. And our tests showed that these were pretty successful products that can become an excellent platform for netbooks and nettops as well as power the mainstream integrated platforms.
Nevertheless, Bulldozer’s success or failure will have much greater significance in the long run. First, this microarchitecture targets the market segments with much higher profit margin – servers and high-performance desktops. Therefore, it may have a great impact on AMD’s financial situation. Second, engineers involved into new microarchitecture design and development have actually nothing to do with the success of AMD’s C, E and A series processors. These CPUs (or APUs in AMD’s terms) owe their success in the market to the integrated Radeon HD graphics cores that got into contemporary AMD processors due to a timely acquisition of ATI Technologies. As for Bulldozer, it is more of a qualification test for the engineering team working specifically on the computational cores microarchitecture. And third, Bulldozer will eventually become the basis for the entire AMD CPU lineup except the products designed for energy-efficient platforms. So one day this microarchitecture will also get to the lower-end market segments completely replacing K10 everywhere, even in Llano processors.
Overall, it is hard to overestimate the importance of a successful launch of Bulldozer microarchitecture. It is a milestone product on the psychological as well as practical level, that is why we really hope to see something like a new K7 or K8.
But even before we get to the actual performance tests we can state that there is very little chance of repeating the above mentioned phenomenon. Last time Intel sort of helped AMD to win the leadership by trying to actively promote a not very ideal NetBurst microarchitecture. Back then Intel engineers bet on growing the clock frequencies, which finally stumbled over gigantic leakage currents, while AMD was offering better-balanced microarchitecture aiming at processing more instructions per clock cycle. But when Intel revised their strategy and launched the new Core microarchitecture, which was also trying to process as many instructions per clock as possible, AMD had to step back to a second place where it has been since then.
It is obviously very difficult to outdo contemporary Intel processors in the number of instructions processed per clock cycle. Today’s Sandy Bridge microarchitecture is the result of at least three rounds of optimizations applied to the design, which was efficient right from the start. Therefore, we can’t expect computational cores from AMD to deliver even higher relative efficiency. Especially, since AMD engineers didn’t even have a goal like that.
Bulldozer has a different primary target. According to the developers, processors based on this microarchitecture should run fast due to high clock frequencies and larger number of computational cores compared with the predecessors and competitors. At the same time, they should remain pretty cost-effective, i.e. have a small semiconductor die and relatively low per-core heat dissipation.
It is quite logical that the increase in the number of processor cores inevitably leads to increase in size of the processor die. As a result, the complexity of the manufacturing process as well as production costs increase as well. Therefore, processors with the largest number of cores, for example, are currently only used in the server segment, because corporate clients are much more eager to pay extra than regular users. AMD’s strategy to increase the number of processor cores while maintaining their acceptable price point implies that they will need to simplify the cores accordingly. However, on the other hand, simplifying the cores may produce some unwanted consequences, such as lowering of the system performance in applications that cannot parallel well, which are currently still pretty numerous.
Therefore, AMD engineers decided to take a unique alternate way. The microarchitecture of individual cores has become more complex increasing the number of instructions executed per single clock cycle where possible.
But parts of the resources, which normally exist in each processor core but have excessive efficiency, are now shared between pairs of cores.
The resulting dual-core units became the primary building blocks for the Bulldozer processors. AMD refers to these units as modules. Each of them has two fully-functional sets of integer execution units. However, the floating point unit, data prefetch units and instructions decoders as well as the L2 cache are shared between the two cores. According to the developers, these components have sufficient potential to feed both cores, because in those cases when there is one such set per core, it is often idling. Moreover, the delays in their stall-free operation do not have any serious effect on the resulting overall performance.
According to AMD, a single module designed as described above can perform at the 80% capacity of a fully-functional dual-core processor. However, they save about 44% of the transistor budget (and consequently of the semiconductor die size).
This inventive approach to increasing the AMD processor cores density allowed the company to design an eight-core (or four-module) Bulldozer semiconductor die.
Moreover, a pretty significant part of the die is allocated for cache memory. L2 processor caches shared between pairs of CPU cores within a single module are 2 MB each, and the L3 cache memory shared across the entire processor is 8 MB big. This way, if we take into account AMD’s traditional exclusive cache design, we can state that the total amount of cache memory reaches 16 MB per eight-core CPU. At the same time, the Bulldozer die size remains within reasonable limits, which means that AMD developers indeed achieved their ultimate goal.
If we talk absolute numbers, it means that eight-core Bulldozer processors will have a smaller semiconductor die than, say, six-core Thuban (Phenom II X6) CPUs with K10 microarchitecture inside. However, it is important to keep in mind that Bulldozer processor will be manufactured using more advanced 32 nm production process. And compared with contemporary quad-core Intel Sandy Bridge processors, the new eight-core AMD CPUs will have only 45% larger die.
However, due to Hyper-Threading technology, the operating system may also see quad-core Intel Sandy Bridge processors as eight-core ones, just like the Bulldozer. This fact may pose a question about how appropriate it is to actually call Bulldozer a fully-fledged eight-core processor. However, it is important to understand that AMD and Intel took different approaches to implementing simultaneous execution of eight computational threads. Intel developers armed their microarchitecture with the ability to execute two computational threads within a single core using only one set of execution units. AMD, on the contrary, removed all “unnecessary” extras from two fully functional cores, but kept two sets of execution units inside each module.
As a result, Intel’s Hyper-Threading technology increases multi-threaded performance only by about 15-20%, while AMD’s solution produced an 80% boost on transition from 4 to 8 threads.
Although I have to admit that the semiconductor die of the eight-core Bulldozer processor looks more like a four-core one because of its modular internal structure.
Obviously, increasing the number of cores is not going to be an ultimate panacea. It has become clear back when AMD launched their Phenom II X6 processors, which were inferior to quad-core Sandy Bridge in performance. Therefore, AMD developers didn’t stop at extensive design modifications. The basic Bulldozer microarchitecture has been changed practically completely compared with K10, so there is hope that AMD systems will speed up not only in multi-threaded tasks, but also in less-parallel applications. And these hopes are backed up by some very objective evidence. While previous AMD microarchitectures were designed for processing up to three instructions per clock (on a single core), Bulldozer microarchitecture should be capable of processing four instructions per clock, which brings it very close to the competitor processor on Core microarchitecture.
We can see some quality changes at the very first stages of the execution pipeline: during instruction prefetch and decoding. These stages are shared between the pairs of cores within a single module, so AMD made sure that they didn’t turn into an architectural bottleneck. The instructions to be decoded are prefetched from L1I cache in 32-Byte blocks – twice as much as in second-generation Core based processors. The actual L1I cache for instructions is 64 KB big and has two-way associativity. The instructions to be decoded are preloaded speculatively into this cache from L2 cache memory.
Instructions are prefetched by the branch prediction unit containing two sets of buffers, which independently monitor activity of different cores. This way Bulldozer doesn’t “get lost” in the threads during branch prediction. Since the new microarchitecture is designed to work at high clock frequency, the quality of branch predictions is extremely important. Therefore, AMD have completely changed the branch prediction algorithms and now they hope that the branch prediction accuracy in the new Bulldozer will improve substantially.
Bulldozer’s x86 instruction decoder is also shared between the two cores and is capable of decoding up to four incoming instructions per clock cycle. However, its performance is limited by four macro-instructions (which are the result of the decoding process, in AMD’s terms), while x86 instructions may disintegrate into 1-2 or even more macro-instructions. So, even though the decoder has become 33% more effective compared with the previous generation microarchitecture, its performance may not be high enough to load optimally two integer clusters and one floating-point cluster.
I have to say that they also used some kind of a macro-fusion technology in the new Bulldozer. Certain groups of x86 instructions may join together and go through the decoder as a single instruction – AMD calls it Branch Fusion.
Decoded macro-instructions are then distributed to three computational clusters, two of which are the remainder of the fully-fledged computational cores and another one, the floating-point cluster, is shared by the cores. Each of these clusters has its own instructions reordering logics and its own scheduler. It obviously means that AMD can eventually fully replace or modify some of these clusters in their future products.
Instructions reordering process in each cluster is based on a physical register file. This file stores links to register contents and eliminates the need to constantly move data around inside the processor once instructions order changes. This approach replaced reorder buffer, because a physical register file is more energy-efficient and more tolerant to processor clock frequency increase.
Integer clusters each contain two arithmetic execution units (ALU) and two units for work with memory addresses (AGU). Unlike K10 microarchitecture, there are one ALU and one AGU less, but according to AMD, it shouldn’t severely affect the performance, while allowing to significantly reduce the die size. I can easily believe that: it doesn’t make much practical sense to have more than two ALU and two AGU per integer cluster, because the decoder can send no more than four macro-instructions per clock for execution by both clusters.
At the same time, the execution units have become more universal and barely differ in functionality.
The organization of the cache-memory system has changed dramatically. They lowered the size of L1D cache from 64 KB to 16 KB and made it inclusive with write-through. At the same time they increased its associativity to 4-way and added a “way predictor”. In order to make up for the serious reduction in size, they increased the bandwidth of the L1 data cache quite substantially, so that now it can process up to three 128-bit operations at the same time – two reads and one write.
The changes in the L1D cache bandwidth are obviously connected with the need to implement 256-bit AVX instructions, which are now supported in the shared FPU unit. However, it doesn’t mean that floating-point units have now become 256-bit. In reality there are four 128-bit units in a single Bulldozer module and AVX instructions are decoded as connected pairs of 128-bit operations. Therefore, floating-point multiply-accumulate (FMAC) blocks unite to execute them and the performance of the floating-point cluster drops to one AVX-instruction per processor module per clock cycle.
FPU doesn’t have its own L1 cache that is why this cluster works with data via integer units.
Since AMD engineers decided to implement Intel’s AVX-instructions support in their Bulldozer processors, they also added support for other current instruction sets, such as SSE4.2 and AESNI for encryption acceleration. Moreover, AMD also introduced a few instructions of their own: triple-operand multiplication and addition (FMA4) and their own unique vision of future AVX development – XOP.
Bulldozer’s L2 cache exists as a single unit inside the processor module and is shared by the cores. It is of impressive 2 MB size and has 16-way associativity. However, the latency of a cache like that increased to 18-20 clocks, while the bus width remained the same as before – 128 bit. It means that even though Bulldozer’s L2 cache is large, it is not particularly fast: the current competitors and predecessors have L2 caches with about half the latency. I have to say that together with a small L1D cache with 4-clock latency (which is also higher than in K10 microarchitecture) it doesn’t look too good. However, AMD insists that they increased the latency of their cache memory only to ensure that Bulldozer would be capable of running at higher clock speeds.
Moreover, AMD engineers implemented an efficient data prefetch unit capable of loading speculative data into the L1 and L2 caches. These units are claimed to be working much more effectively now and should be capable of recognizing irregular data structures.
Theoretically, Bulldozer looks very attractive. AMD have completely revised their old vision of processor microarchitecture and came up with a totally new design. And I have to say that this new design looks highly promising, because the new microarchitecture has been optimized for processing four instructions per clock instead of three in one processor core. Besides, it also supports macro-fusion of instructions before the decoding stage, which increases the effective performance even more.
But everything looks picture perfect only when we look at one core and do not take into consideration the fact that in reality such cores are combined into pairs. And the dual-core Bulldozer module has too many units shared between the cores. In particular, since a module like that has only one instruction prefetch unit and one decoder, the entire dual-core block can still process only four instructions per clock. And it means that in terms of theoretical performance it is a Bulldozer module, but not the actual core that would be considered a logical equivalent to a single core in Sandy Bridge processors. in this case the module’s ability to perform two threads looks like a pretty logical response from AMD to Intel’s Hyper-Threading technology.
Of course, our performance tests of the new processors will dot all i’s, but even at this point we can’t help thinking that Bulldozer’s positioning as an eight-core processor is more of a marketing move. In reality it is the number of modules that gives us a better idea of these processors’ computational potential. In respect to theoretical performance it seems more logical to compare these modules to cores in the terms of second-generation Intel Core microarchitecture.
Therefore, a logical question pops up: why did AMD decide to implement two-thread processing within a single processor module? Why couldn’t they simply combine the execution units in two different cores into a single cluster? There are several reasons for that.
First, they need advanced inter-processor logics to be able to load numerous execution units optimally. AMD, however, didn’t succeed with the implementation of highly efficient branch prediction and instruction and data prefetch. Therefore, it is the responsibility of software developers to deliver Bulldozer-compatible applications supporting multi-threading, which are well-paralleled and use the execution units optimally.
Second, larger number of simultaneously processed computational threads is a good thing. While desktop users and especially gamers will hardly benefit that greatly from eight fairly simple Bulldozer cores, this microarchitecture should be highly welcome in the server environment. So, it is quite possible that the primary goal for Bulldozer was regaining AMD’s leadership in the server market rather than making computer enthusiasts happy.
Energy-efficiency is one of the most important characteristics of contemporary processors. Intel, for example, puts the objective of lowering the power consumption of their upcoming microarchitectures atop of the list. AMD hasn’t yet got there, and their engineers are still primarily chasing higher speeds. But it doesn’t mean that the developers didn’t pay due attention to Bulldozer’s thermal and power characteristics. On the contrary, after Llano the principally new approaches to energy-efficiency found their way into the Bulldozer processors. However in this case the developers used the freed potential not that much for energy savings, but rather for increasing the clock frequencies and thus improving the performance even more.
Of course, the finer production process did have some positive effect on the power consumption and heat dissipation readings. Bulldozer is manufactured with high-K dielectric 32 nm process, metal gate transistors and SOI technology. In other words, it is the same GlobalFoundries process that is used for Llano manufacturing. As a result the mass production eight-core Bulldozer processors maintain 1.4 V maximum core voltage.
However, the major innovation inherited from Llano is the use of power gating, which should disconnect the power from selected parts of the CPU. They allow shutting down power on selected dual-core modules and cache-memory in Bulldozer processors.
When both computational cores within one module switch to C6 power-saving mode, the module power turns off. Unfortunately, this technology cannot apply to processor cores, because there are simply no individual cores inside Bulldozer – they share some of their resources with the other cores within the same modules.
C6 power-saving modes also control the Turbo Core technology in Bulldozer processors. When at least half of Bulldozer processor modules are off and in power-saving mode, its core voltage and clock frequencies increase. This forced mode is called Max Turbo Boost.
However, there is nothing new in Max Turbo Boost mode, as AMD introduced the same automatic overclocking back in their Thuban processors on K10 microarchitecture. The principally new thing here is the All Core Boost mode, when the clock frequency may increase beyond its nominal value even when all processor cores are active. The enhanced version of Turbo Core implemented in Bulldozer processors allows them to accurately assess their actual power consumption and heat dissipation judging by the utilization level of different units. So, if according to the processor’s estimate the current power consumption and heat dissipation are well below the threshold values, the processor can increase its core voltage and clock speed even if none of the cores are in idle mode.
So, the clock frequency of Bulldozer based processors is an extremely variable value. It may change dramatically in a very large interval (up to 900 MHz) depending on the “heaviness” of the executed algorithms and on the number of active cores.
With the launch of the new microarchitecture AMD not only kept the design of the new platform, but even maintained compatibility of the new Bulldozer processors with the existing infrastructure. As a result, just like their predecessors, the new processors contain an integrated North Bridge with the L3 cache, memory controller and Hyper-Transport bus controller. At the same time although all recently launched AMD and Intel processors also have an integrated PCI Express graphics bus controller, the new Bulldozer doesn’t have it.
Just like processors based on K10 microarchitecture, the North Bridge in Bulldozer processors works at its own clock frequency, which is set at 2.0-2.2 GHz for different CPU models. Note that this frequency does have some effect on performance, because it directly affects the speed of L3 cache. And as we have already said, the new processors have an 8 MB L3 cache with 64-way associativity. Per special request from the corporate users, the data stored in this cache-memory is protected with error correction code (ECC).
The memory controller in the Bulldozer processors doesn’t boast anything principally new. Just as before, it supports DDR3 SDRAM, uses dual-channel design and in fact consists of two independent single-channel controllers that may work as a pair or independently. The only thing AMD added here is the support for faster memory types, such as DDR3-1867, and compatibility with energy-efficient memory modules working at 1.25 V and 1.35 V.
Speaking of the desktop Bulldozer modification codenamed Zambezi, we should mention that it is designed for the new Socket AM3+ platform also known as Scorpius. Socket AM3+ has 942 pins, which is 1 pin more than Socket AM3 has. However, despite the pin difference the new Zambezi will be compatible with the old Socket AM3 mainboards, too. If you use a new processor with the old mainboard, you will only lose some selected power management functions. For example, the frequencies will switch slower with active Turbo Core and Cool’n’Quiet and Vdrop will not work at all.
Nevertheless, AMD worked closely with all mainboard manufacturers to make sure that by the time Zambezi launches there will be numerous new products available based on the new chipsets from the 900-series. The flow-chart below shows a typical system built around Zambezi processor and the new chipset:
The distinguishing feature of the new AMD 990FX (and its simpler modifications – AMD 990X and AMD 970) is basically just the support of specific electrical peculiarities of the new Socket AM3+. There are no new interfaces of any kind. Just like the 800-series chipsets, the new South Bridge supports six 6 Gbps ports and fourteen USB 2.0 ports. Even though we were dying to see such things as PCI Express 3.0 or at least USB 3.0 support in the new chipsets, there is nothing like that. It is actually pretty strange because the chipsets for the lower-end Socket FM1 platform did acquire USB 3.0 support.
The only differences between the new chipset modifications are the types of supported multi-GPU configurations:
The launch of Zambezi processors completes AMD’s processor line-up update. Desktop CPUs based on the new Bulldozer microarchitecture will be the new flagship product, which will quickly oust from the market all Phenom II models.
In order to stress the innovative nature of their new microarchitecture, AMD will use a difference marketing name for their Zambezi processors – FX. On the one hand, it fit perfectly into the new naming system that implies the use of letters for CPU marking, but on the other hand, it reminds of the legendary Athlon 64 FX processors, which were the fastest desktop CPUs 6-7 years ago. However, those times are long gone, so let’s take a closer look at what we are being offered today.
There will very soon be four FX processor models available in the market:
Although Zambezi processor models differ not only by the clock speeds, but also by the number of active computational cores, they will all be built from the same unified semiconductor die. Here it is:
In order to build processors with fewer cores than eight, AMD will disable some of them on the semiconductor die. It is still a question, whether they can be unlocked the same way we did with processors on K10 microarchitecture. Nevertheless, we saw all the corresponding options in the BIOS Setup of several mainboards built around the new 900-series chipsets, so there is definitely hope for the positive outcome.
The production of six-core and quad-core processors will imply the per-module core locking. It means that they will lock the entire dual-core module rather than a second core in two modules like that, although the latter approach could be much more efficient from the performance perspective. However, six- and quad-core Bulldozer processors are merely the way to utilize the defective dies, which may be quite numerous since they are going to use the new production process and the die is pretty large in size.
Although AMD optimized the new microarchitecture for operation at high clock speeds, we can’t say that they have reached any impressive break-through. The 4 GHz threshold is still unreached and the nominal frequency of the top FX processor is even lower than that of Phenom II X4 980. We hope that as they master the production process, Zambezi frequencies will continue to grow rapidly. Although according to the current AMD roadmap, the new processor family should start speeding up no sooner than in Q1 2012.
We don’t see any dramatic victories in terms of power consumption and heat dissipation either. AMD have been promising us for a long time that the new Bulldozer would be more energy-efficient than predecessors, but in reality the top eight-core models have the same TDP as the top Phenom II CPUs. Although very soon they should add a 95 W FX-8120 model as well as an FX-8100 with the same TDP to their lineup.
On the other hand, the prices of the new FX processors seem to be more than attractive. AMD doesn’t want to deviate from their plan to continue offering platforms at a lower price than competition, that is why the top eight-core Zambezi processors are positioned against the top Core i5 CPUs. Overall, AMD is going to stick to the following positioning plan:
In other words, AMD has no intention whatsoever to compete against six-core Intel CPUs and the upcoming LGA2011 and intends to focus on the mainstream segment.
Great news for enthusiasts is that all FX processors will come with unlocked multipliers. All Zambezi CPUs can easily be overclocked not only by simply adjusting the base clock multiplier, but also by reconfiguring their Turbo Core technology. You can also overclock the memory sub-system and the frequency of the North Bridge integrated into the processor.
AMD gave us the opportunity to check out the new Zambezi processor – FX-8150.
Its nominal clock speed is 3.6 GHz and you can find out more about it from the following CPU-Z screenshot:
Note that it uses a B2 processor stepping and it is far not the first version already. The previous modifications of the semiconductor die didn’t make it because they refused to work at the originally planned clock speeds. This is actually why the spring launch was slightly pushed back to summer and then to fall and finally took place in the middle of October.
However, the today’s frequency of 3.6 GHz doesn’t look very impressive. Both, AMD as well as Intel, have products working at higher frequencies. However, FX-8150 supports very promising Turbo Core technology, which is capable of automatically increasing the CPU clock frequency to 4.2 GHz under low load.
It is remarkable that 3.9 GHz frequency may be reached even if all the processor cores are working, but there is sufficient margin for automatic overclocking without getting beyond the power consumption and heat dissipation limits.
In idle mode Cool’n’Quiet technology lowers the clock frequency of FX-8150 processor to 1.4 GHz. The Vcore in this case drops to 0.85 V.
We are going to compare the new eight-core AMD FX-8150 processor on Bulldozer microarchitecture against one of its predecessors – six-core Phenom II X6 as well as against competitors from Intel – quad-core Core i5-2500 and Core i7-2600. Moreover, we also added the performance numbers for the six-core Core i7-990X CPUs.
As a result, our testbeds were built using the following hardware and software components:
Note that we ran all tests under the current Windows 7 version, but AMD indicates that the task manages of this OS doesn’t distribute the computational threads in the optimal way. Windows 7 prefers to primarily direct all threads to cores inside different modules. And it does in fact deliver highest relative performance, because it allows reducing the load on the shared units inside each module. However, this strategy prevents the use of turbo-modes, which could kick in if some of the dual-core processor modules were in power-saving mode.
The upcoming Windows 8 OS will work differently assigning the computational threads to cores within the same module first. As a result, AMD promises that Zambezi performance may increase by as much as 10% in some selected applications.
Before we got to the actual benchmarking part, we decided to try and predict what we could expect the new Bulldozer microarchitecture to be capable of in general. To accomplish this we compared the new processor against other CPUs on K10 and Sandy Bridge microarchitectures in synthetically created identical environments: at the same clock frequency and with the same number of active cores.
To be more exact we compared AMD FX-8150, Phenom II X6 1100T and Core i7-2600 at 3.6 GHz frequency and with only two active computational cores. To ensure the purity of the experiment we disabled all power-saving and auto-overclocking technologies. We used a set of simple synthetic benchmarks in SiSoft Sandra 2011 suite, where we manually disabled all instructions beyond SSE3, because K10 microarchitecture doesn’t support them.
The numbers in this table speak louder than words. The performance of Bulldozer microarchitecture has become way lower than that of the previous-generation processors. The simplification of Bulldozer microarchitecture by combining a pair of cores into a single module with shared resources led to a significant (25-40%) drop in specific performance compared with the previous-generation AMD microarchitecture. As a result, Bulldozer cores do not just work at half the speed of Sandy Bridge cores. In addition to that the performance of the Bulldozer processor module with two cores is even lower than that of a single Sandy Bridge core with enabled Hyper-Threading technology. Should we expect any performance records from a CPU with such microarchitecture? This is more of a rhetorical question…
At the same time let’s take a look at the practical characteristics of the caches and memory sub-system. To estimate the performance of these functional units we resorted to Cachemem utility from Aida64 suite. We used DDR3-1600 SDRAM with 9-9-9-27-1T timings. Just as in the previous case, the processors all worked at 3.6 GHz clock frequency.
As we can see, the practical latencies of all caches and memory sub-system in Zambezi processors increased. We have already discussed it in the chapter devoted to Bulldozer microarchitecture. However, the memory bandwidth increased almost in all cases due to modifications of the internal cache-memory organization.
At the same time, the fastest dual-channel memory controller and the fastest cache-memory sub-system are the ones in Sandy Bridge. Although in terms of cache size, the ne Bulldozer will be superior.
As usual, we use Bapco SYSmark 2012 suite to estimate the processor performance in general-purpose tasks. It emulates the usage models in popular office and digital content creation and processing applications. The idea behind this test is fairly simple: it produces a single score characterized the average computer performance.
As you probably remember, a little while back AMD tried to troll SYSmark stating that it wasn’t an objective benchmark because of the “unfair” combination of applications it used. However, in our opinion, this complaint is unjustified, because the performance was estimated using widely spread and really popular programs. The contribution of each such program into the final test score is given on the following diagram:
Therefore, we decided not to give up SYSmark 2012 and continue using this suite to estimate the performance in general-purpose applications.
The first test turned out a big disappointment. The eight-core FX-8150 processor is only 10% faster than the six-core Phenom II X6 1100T and of course, it is way behind the quad-core Intel CPUs. So, it looks like AMD’s decision to build a processor with a lot of cores featuring low specific performance instead of using a moderate number of complex cores doesn’t work as well as they expected it to.
Let’s take a closer look at the performance scores SYSmark 2012 generates in different usage scenarios.
Office Productivity scenario emulates typical office tasks, such as text editing, electronic tables processing, email and Internet surfing. This scenario uses the following applications: ABBYY FineReader Pro 10.0, Adobe Acrobat Pro 9, Adobe Flash Player 10.1, Microsoft Excel 2010, Microsoft Internet Explorer 9, Microsoft Outlook 2010, Microsoft PowerPoint 2010, Microsoft Word 2010 and WinZip Pro 14.5.
Media Creation scenario emulates the creation of a video clip using previously taken digital images and videos. Here they use popular Adobe suites: Photoshop CS5 Extended, Premiere Pro CS5 and After Effects CS5.
Web Development is a scenario emulating web-site designing. It uses the following applications: Adobe Photoshop CS5 Extended, Adobe Premiere Pro CS5, Adobe Dreamweaver CS5, Mozilla Firefox 3.6.8 and Microsoft Internet Explorer 9.
Data/Financial Analysis scenario is devoted to statistical analysis and prediction of market trends performed in Microsoft Excel 2010.
3D Modeling scenario is fully dedicated to 3D objects and rendering of static and dynamic scenes using Adobe Photoshop CS5 Extended, Autodesk 3ds Max 2011, Autodesk AutoCAD 2011 and Google SketchUp Pro 8.
The last scenario called System Management creates backups and installs software and updates. It involves several different versions of Mozilla Firefox Installer and WinZip Pro 14.5.
The Bulldozer based processor demonstrates different results in different usage models. In some case it runs even slower than Phenom II X6, but there are also a few opposite situations. Overall the general rule can be defined as follows: FX-8150 is particularly efficient in applications with multi-threaded and well-paralleled load, which at the same time is not computationally challenging.
However, even in the most favorable situations FX-8150 falls behind Core i5-2500. The only scenario where these processors demonstrate comparable speed is 3D rendering. Other than that the Intel product is on average 25% faster, which is kind of sad…
As you know, it is the graphics subsystem that determines the performance of the entire platform equipped with pretty high-speed processors in the majority of contemporary games. Therefore, we do our best to make sure that the graphics card is not loaded too heavily during the test session: we select the most CPU-dependent tests and all tests are performed without antialiasing and in far not the highest screen resolutions. In other words, obtained results allow us to analyze not that much the fps rate that can be achieved in systems equipped with contemporary graphics accelerators, but rather how well contemporary processors can cope with gaming workload. Therefore, the results help us determine how the tested CPUs will behave in the nearest future, when new faster graphics card models will be widely available.
Games are not among those tasks that create parallel multi-threaded loads. Therefore, quad-core processors suit gamers’ needs much better than AMD’s multi-core monsters. The diagrams above are a great example of that. The new eight-core FX-8150 is not any faster than its six-core predecessor – Phenom II X6.
As for the gaming performance correlation between Zambezi and Sandy Bridge, things are far not that optimistic for AMD. The current Intel microarchitecture copes much better with typical workload created by 3D games and there is absolutely no hope that AMD will ever manage to catch up with the competition here. In other words, the only time it makes sense to use Bulldozer for gaming would be the situation when you are absolutely sure that the given processor will be fast enough in the specific graphics sub-system and in specific games. However, even in this case it is important to understand that the next graphics card upgrade may actually have an adverse effect and you will be in a worst situation than those users who have initially preferred an Intel platform.
In addition to our gaming tests we would like to offer you the results of the synthetic Futuremark 3DMark11 test run with the Extreme settings profile.
We added these results in order to show the ideal situation for FX-8150, namely when the video sub-system doesn’t actually allow the processor to show its full potential. In this case the graphics card is loaded to the fullest and the CPU performs an auxiliary function. In this case we can state that Bulldozer and Sandy bridge processors are equally fast, although this is not exactly true.
However, the new FX-8150 looks quite good in the 3DMark11 Physics test (especially against the background of the previous results). The new eight-core AMD processor performs comparably with the quad-core Intel Core i5-2500 during the multi-threaded calculation of the gaming physics model.
I have to say that the general and gaming performance of the new desktop Bulldozer turned out lower than we expected. However, we are not giving up and are ready to look for situations where new AMD microarchitecture will really shine.
To test the processors performance during data archiving we resort to WinRAR archiving utility. Using maximum compression rate we archive a folder with multiple files 1.4 GB in total size.
FX-8150 performance turns out close to that of Core i5-2500. WinRAR is not one of those applications that can split the load into eight parallel threads for all eight Bulldozer cores, but gigantic cache-memory seems to be saving the situation here.
The second similar test of the archiving speed is performed in 7-zip that uses LZMA2 compression algorithm.
FX-8150 does really great in 7-zip. This eight-core processor gets very close to the quad-core Core i7-2600 with enabled Hyper-Threading, which can also execute eight threads at the same time, just like the new Bulldozer.
The processor performance during encryption is measured with an integrated benchmark from a popular cryptographic utility called TrueCrypt. I have to say that it can not only effectively utilize any number of processor cores, but also supports special AES instructions.
Well-paralleled simple integer algorithms are exactly what Bulldozer microarchitecture needs. As we can see, the performance may be pretty impressive in this case. Namely, the only processor FX-8150 couldn’t outperform was the six-core Core i7-990X. As for all LGA1155 processors, our hero was way ahead all of them.
We use Apple iTunes utility to test audio transcoding speed. It transcodes the contents of a CD disk into AAC format. Note that the typical peculiarity of this utility is its ability to utilize only a pair of processor cores.
Applications generating few computational threads are not a good match for Bulldozer. Individual cores of this processor are too weak to perform well here.
We measured the performance in Adobe Photoshop using our own benchmark made from Retouch Artists Photoshop Speed Test that has been creatively modified. It includes typical editing of four 10-megapixel images from a digital photo camera.
In Photoshop FX-8150 doesn’t perform as poorly as K10 based processors, but it is still unable to catch up with Core i5-2500. In this case large cache memory helps Bulldozer microarchitecture a lot, but it is not enough to guarantee victory. The efficiency and specific performance of the computational cores are still the primary factor.
We have also performed some tests in Adobe Photoshop Lightroom 3 program. The test scenario includes post-processing and export into JPEG format of a hundred 12-megapixel images in RAW format.
Lightroom is capable of splitting the photo processing between any number of cores that is why eight-core FX-8150 does pretty well here. Although I have to admit that “pretty well” could be considered a very relative term in this case, as its performance is only comparable with that of the Core i5-2500. And therefore it means that two Bulldozer cores are equivalent to one Sandy Bridge core without Hyper-Threading.
The performance in Adobe Premiere Pro is determined by the time it takes to render a Blu-ray project with a HDV 1080p25 video into H.264 format and apply different special effects to it.
Previous-generation AMD processors coped pretty well with video transcoding. Bulldozer microarchitecture did even better in this type of applications that is why FX-8150 performs even faster than Core i5-2500.
We estimated the video editing speed in Adobe After Effects by measuring the time it took to apply a combination of filters and special effects such as blur, bulge, color key, frame blending, glow, motion blurring, fading, 2D and 3D manipulation, shadows, echo, median, radial blur, invert, etc.
Although this is a well-paralleled type of load, FX-8150 falls behind Intel competitors in After Effects.
In order to measure how fast our testing participants can transcode a video into H.264 format we used x264 HD benchmark. It works with an original MPEG-2 video recorded in 720p resolution with 4 Mbps bitrate. I have to say that the results of this test are of great practical value, because the x264 codec is also part of numerous popular transcoding utilities, such as HandBrake, MeGUI, VirtualDub, etc.
AMD processors have always performed well during x264 video transcoding tests. Now that their eight-core microarchitecture is out the results improved even more. FX-8150 outperforms even Core i7-2600 during the second most resource-hungry pass. So, finally, we found a second application, besides TrueCrypt, where processors on Bulldozer microarchitecture do absolutely great.
Rendering speed in Autodesk 3ds max 2011 was measured using a special SPECapc test. Starting with this review we are going to use a new professional version of SPECapc for 3ds Max 2011.
Rendering is also a task, which is well optimized for multi-core microarchitectures. However, despite this fact FX-8150 still runs slower than Core i5-2500 and Core i7-2600, not to mention Core i7-990X. On the other hand, the new AMD processor doesn’t lose to its predecessor, so things aren’t bad after all.
Summing up all obtained results in individual applications we can conclude that in our tests the new FX-8150 was about 14% faster than Phenom II X6 1100T. As a result, it was not any slower than Core i5-2500 in almost half of all tests. However, the lag behind the next Intel mode, Core i7-2600, still remains pretty serious and exceeds 10%.
Although we managed to find a set of applications where Bulldozer performance is fairly good, the CPUs based on this new microarchitecture are far from being considered revolutionary. Our only hope at this point is the power consumption, because previous AMD processors were way behind their competitors in this aspect. Now, however, the new microarchitecture is promised to be much more energy-efficient. Plus the new finer 32 nm process should have contributed to the improvement of the electrical characteristics of the new processors. so, let’s check out the performance-per-watt of the new FX-8150.
The graphs below show the full power draw of the computer (without the monitor) measured after the power supply. It is the total of the power consumption of all the system components. The PSU's efficiency is not taken into account. The CPUs are loaded by running the 64-bit LinX 0.6.4 utility. We enabled all the power-saving technologies for a correct measurement of the computer's power draw in idle mode: C1E, C6, AMD Cool'n'Quiet and Enhanced Intel SpeedStep.
In idle mode systems with Bulldozer based processors consume less power than similar systems with Phenom II CPUs. However, contemporary LGA1155 systems from Intel still consume the least power of all.
In case of single-threaded load the power consumption of Socket AM3+ system rapidly increases, which most likely happens because highly aggressive Turbo Core technology. Intel base systems do not demonstrate anything like that and they can again boast much better energy-efficiency.
In case of heavy multi-threaded load things do not really change much. The only difference is that the LGA1366 system with Core i7-990X inside dashed forward. Otherwise, things are exactly the same. FX-8150 can’t boast any specific power-saving success. It does consume a little less than Phenom II X6 1100T, but Intel Sandy Bridge processors are still at least 1.5 times more energy-efficient.
AMD used all the energy-efficiency they gained from the new microarchitecture to increase the clock speeds. And in the end there is principally significant improvement neither in energy-efficiency nor in performance. Therefore, in the performance-per-watt aspect the new Bulldozer, just like its predecessors, is still seriously behind the competing Intel microarchitectures.
For your reference here are the power consumption readings from the isolated CPU and mainboard power rails:
The “pure” power consumption of the eight-core FX-8150 is about twice as high as that of Sandy Bridge processors. Since all of them are manufactured using the same production process and have similar core voltage, it becomes extremely interesting what exactly AMD meant by the energy-efficiency of their Bulldozer microarchitecture.
Socket AM3+ platform and FX-series processors are positioned as overclocking-friendly right from the start. This follows not only from the fact that all FX processors have unlocked multipliers, but also from a number of extreme overclocking experiments supported by AMD, in one of which they set an overclocking world record using a new FX-8150 processor. The company’s statement about the new microarchitecture being well-optimized for work at high frequencies also seems very promising. Could it be a new overclocking wonder? Let’s find out.
It is extremely easy to overclock any FX processors: their logo states “Unlocked” for a reason. You can change the processor clock frequency by changing its multiplier right in the mainboard BIOS Setup, or via special utilities from AMD (Overdrive Utility) as well as from mainboard vendors. You can also overclock the integrated North Bridge and system memory in Socket Am3+ system the same way.
During our tests we managed to get our FX-8150 to work stably at 4.6 GHz. For increased stability we raised the processor core voltage to 1.475 V and enabled Load-Line Calibration option. During our stability tests the CPU temperature at this frequency didn’t exceed 85°C, according to the under-the-socket diode and 75°C, according to the integrated thermal diode in the CPU itself. As we have already said, we used a very efficient air-cooler – NZXT Havik 140.
Note that we also tried to simultaneously overclock the North Bridge integrated into the processor, because increasing its frequency will have a positive effect on the L3 cache memory and memory controller performance. However, unfortunately, we couldn’t get past 2.4 GHz frequency even though we tried to raise its voltage as well.
In any case, the result of our FX-8150 overclocking experiment – 4.6 GHz frequency – is a definite success, especially since AMD Phenom II processors rarely overclocked beyond 4.0 GHz with air-cooling. In other words, Bulldozer microarchitecture really managed to push the frequency maximums somewhat further away.
However, we should actually compare the results of our FX processors overclocking with those of Intel Core i5 and Core i7 processors for LGA1155 systems. And these guys overclock just as good. For example, Core i5-2500K will typically overclock to 4.7 GHz under an air-cooler and with the Vcore increased by 0.15 V. and in this comparison, FX-8150 doesn’t look so victorious anymore.
Our impression from Zambezi overclocking will be spoilt even more if we compare the performance of the overclocked FX-8150 and Core i5-2500K (the increase compared with the nominal mode is given in brackets):
Overall, overclocking doesn’t really change the situation. However, in those applications where FX-8150 was faster in nominal mode, the gap is no longer that dramatic. And in those tests where Core i5-2500 was ahead, it managed to strengthen its positions even more. In fact, it is not surprising at all: the clock frequency of our FX-8150 processor increased by 28% during overclocking, while the frequency of Core i5-2500K got 42% higher. Moreover, as we can tell from the way the frequency grew during overclocking, Intel Sandy Bridge microarchitecture is more sensitive to frequency increase. In other words, even if we take into account overclocking, the new Bulldozer processors don’t look superior to Intel’s ones, even though they overclock pretty well.
So, is it success or failure? I am sure most of you would love to see a clear and definite verdict here. However, things are not so simple this time and AMD Bulldozer made things really difficult for all reviewers.
The thing is that AMD revealed a totally unique approach to developing new microarchitecture. Keeping in mind that the processor performance consists of three major components, such as number of instructions per clock, core frequency and number of cores, AMD engineers shifted their priorities towards the number of cores this time. They lowered the specific core performance, but at the same time got the opportunity to create inexpensive eight-core or even more complex processors. This is a very important milestone for the server market where multi-threaded loads dominate and multi-core processors are in high demand. So, the new Bulldozer microarchitecture from AMD will most likely help the company to strengthen their positions in the high-performance server segment.
However, today we introduced to you an FX processor based on the new Bulldozer microarchitecture but designed for the desktop segment. And this is where we observed a dramatic the mismatch between Bulldozer’s hardware functionality and the needs of typical desktop applications. It is particularly frustrating that the entire marketing effort was aimed at making us believe that Bulldozer will be the rising star of the desktop market. Unfortunately, this never happened.
FX processors based on Bulldozer microarchitecture managed to show their strengths only in a small variety of common user tasks. There are very few popular applications, which would generate simple multi-threaded integer load and this is the only case when Bulldozer really performs at its best. As a result, in certain applications the new Bulldozer is not just slower than competitors from Intel, but is even slower than the previous-generation Phenom II X6. And it means that AMD didn’t succeed in launching a revolutionary desktop CPU.
In fact, FX is just another Phenom, which looks pretty good especially compared with the predecessors. Overall, FX processors are faster than Phenom II, they overclock much better and consume slightly less power, so they will be a good replacement for the CPUs on old K10 microarchitecture.
However I would like to remind you that AMD is competing not only against itself, but also against Intel. Therefore, we have to draw this unwelcome conclusion that FX processors will only be a good choice for those desktop systems that will primarily be used for video processing and transcoding. In all other cases Bulldozer processors, unfortunately, cannot compete against Sandy Bridge. The same is true for power consumption as well as overclocking. I would also like to add that AMD FX processors quite expectedly turned out a poor choice for gamers, because contemporary 3D games barely use true multi-threaded algorithms. However, I am sure that dedicated AMD fans will be able to put up with that, since the fps rate in games is in most cases limited by the graphics card, rather than processor.
In other words, the marketing success of the new FX processors will solely depend on two factors: how numerous AMD fan-club is and how smart the company will use their pricing strategies. But either way the desktop Bulldozer-based processors will hardly ever become truly popular.