AMD Talks Fusion: Technology, Software, Strategy

The Fusion technology is the topic that AMD has discussed the most in the last five years. As we approach the launch of the first Fusion APU, the company starts to reveal more details about the architecture of Fusion and its technical aspects. Find out how AMD plans to tweak performance of Llano, why it decided to use TSMC fabs for Ontario, why did it take more than four years to fuse CPU with GPU and more insights about the AMD Fusion program.

by Anton Shilov
12/07/2010 | 11:36 PM

The Fusion program by Advanced Micro Devices is probably among the most significant hardware innovations in the recent years since the introduction of multi-core x86-64 microprocessors with integrated memory controller. Putting a graphics processor inside the central processing unit is not just fusing two different computer components, but ultimately such hybrid chips called accelerated processing units (APUs) will open up new computing paradigm and new types of processors.

 

For AMD, the Fusion initiative is not only a vehicle to ride into heterogeneous multi-core future, but also the prove that the company did the right thing when it acquired ATI Technologies back in 2006 and essentially sold-off its manufacturing facilities in 2008.

But what we actually know about the Fusion program by AMD, except the fact that it has cost billions of dollars? What AMD hopes on? Why did it take so long to release the first Fusion chip? What are AMD's expectations for the forthcoming Brazos platform for low-cost notebooks and netbooks powered by Ontario and Zacate APUs? What kind of benefits should consumers expect from the new hybrid processors and what should not be expected? Today we are talking to Godfrey Cheng from AMD to find out more about the past and present of Fusion from the first hands.

X-bit labs: Hello, can you introduce yourself to our readers, please?

Godfrey Cheng: My name is Godfrey Cheng and I am director of client technology unit at AMD. My focus is primarily around technology planning for our products and developing plans to use our Fusion APUs to accelerate current types of computing as well as new ones enabled by Fusion APUs.

In general words, Mr. Cheng is one of the key people responsible for success of AMD Fusion business. While the first Fusion products are developed for low-cost notebooks and multimedia netbooks, eventually the APUs will be inside desktops and notebooks and will represent the largest chunk of AMD's business in terms of revenue. The success of all these platforms will be determined not only by performance supplied by chips, but also by software that should be able to take advantage of parallel compute capabilities of the forthcoming APUs.

Fusion Hardware: When 1+1≠2

What Took So Long?

AMD originally promised to release hybrid chips with integrated x86 processing elements and graphics processing elements in late 2008 or early 2009. But after numerous failures and a couple of faulty prototypes the company made major to its roadmaps and delayed the actual Fusion products to 2010 - 2011 timeframe and at the same time boosted specs of the APUs with DirectX 11 graphics cores as well as new CPU micro-architectures.

X-bit labs: The road to Fusion of technologies from AMD and ATI has been very long and rather hard. What were - and still are - the main challenges for integration of CPU and GPU?

Godfrey Cheng: If you look at the very high level, people automatically confuse that this is a simple integration of a CPU and a GPU [into a single piece of silicon]. This is not the case. What we are doing is we are adding parallel computing unit and additional fixed-function units into the mix for the APU. [...] We are also adding new program paradigm with OpenCL and DirectCompute.

There are natural hardware challenges. We do have to innovate two physically different devices, glue them alltogether and make sure they work with new protocols like OpenCL.

X-bit labs: So, it was not the hardware issue, but it was a complex of problems that you had to solve first before proceeding with an actual product?

Godfrey Cheng: Yes. You can look at it this way. we would view Fusion as not only products but we need to plan our architecture for the current generation and two-three generations down the road. This does take time and it will see our products take big steps along the way in innovation. We need to make sure they are forward and backward compatible and that takes some time.

X-bit labs: Basically for Ontario and Zacate you needed not only Bobcat micro-architecture, but also OpenCL and DirectCompute-compliant graphics cores and other things in place before fusing them in one chip?

Godfrey Cheng: CPU architecture, OpenCL/DirectCompute-compliant GPUs as well as fixed-function hardware. In the Fusion parts that we are going to launch early next year we have updated the UVD block, for example.

In mid-2010 the company decided to switch release timeframes of lower-power Ontario/Zacate and high-performance Llano APUs. Apparently, the yields of Llano, which is to be made by Globalfoundries using 32nm silicon-on-insulator process technology, were lower than expected; meanwhile, yields of lower-performance Ontario/Zacate produced at 40nm node of Taiwan Semiconductor Manufacturing Company (TSMC) were fine. As a result, the era of Fusion starts with low-cost low-power chips.

X-bit labs: What was the reason why you decided to use TSMC process technology and not Globalfoundries' fabrication processes?

Godfrey Cheng: At the very high level, AMD is now a separate company from Globalfoundries. Both Globalfoundries and TSMC are important partners for us. As a company, we are interested in having more than one source for our products. At a technical level, we actually have a lot of IP [libraries] with TSMC in terms of our GPUs, display and analogue technology. So, it is easier task to integrate under TSMC because of  those display and analogue libraries.

X-bit labs: What version of TSMC's 40nm process technology did you use? As far as I understand, you used TSMC's 40LP flavour for Ontario and Zacate, am I correct?

Godfrey Cheng: What we did was that we leveraged as many of the technologies as we could from the GPU side.

Brazos Platform: Over 90GFLOPS of Compute Power

X-bit labs: How high the projected performance of the Ontario/Zacate in terms of GFLOPS is? Was it actually a major priority to achieve certain levels of performance?

Godfrey Cheng: With Zacate we are looking at over 90GFLOPS, which is pretty impressive for a chip of that thermal design power and size... Certain performance goals were not a major a priority.

X-bit labs: How many memory channels are inside Ontario and Zacate?

Godfrey Cheng: Ontario and Zacate have single-channel DDR3 memory controller.

X-bit labs: Is there going to be a performance limitation because of memory bandwidth?

Godfrey Cheng: For Ontario/Zacate we are adding PC3-10600 (DDR3 1333MHz) support, it is a fairly robust memory controller and for the vast majority of applications we see no problems in going single channel.

Indeed, in case of Brazos platform AMD decided not to squeeze all the performance possible out of its chips, but focused on making them as small as possible, as power efficient as possible and as feature-rich as possible. The result is obvious: the 9W dual-core Ontario chip delivers DirectX 11 graphics; parallel compute functionality; support for hardware decoding of various video formats, including Blu-ray; as well as relatively modern processor micro-architecture that is at least a decade younger compared to the heart of Intel Atom.  

X-bit labs: Is memory controller of Fusion-series chips similar to those used today, or you tweaked them somehow?

Godfrey Cheng: If you look at Llano, then memory controller is quite different to Ontario/Zacate. We have also done some tweaks for GPU in Llano. We try to design the best memory controller we can for each of our products. That is all we can share at this time.

Dynamic Speed Boost in a Netbook?

X-bit labs: Will Zacate support some kind of dynamic overclocking capability to speed up graphics or one of x86 cores when needed?

Godfrey Cheng: We are going to look at how we are going to clock different components within the GPU properly. We are definitely going to have different parts of the APU running at different speeds...

X-bit labs: So, will you have a dynamic overclocking technology like TurboCore on AMD desktop chips or TurboBoost on Intel processors?

Godfrey Cheng: We can certainly throttle clock-speeds down for power savings. [...] We are basically still finalizing [specs of] production [units], doing very minor tweaking, [but, no exact details are available at this time].

X-bit labs: So, we should not probably expect any dynamic performance-enhancing technologies from there?

Godfrey Cheng:  The architecture is certainly capable, but we are evaluating the benefits versus detriments. What may work for our competitor's products may not work for our products. Our architectures within the chip are vastly different so comparing products at this level may not make a lot of sense.

What you are going to see as a strategy is that we will try to spread the TDP and dynamically  as much as possible within certain performance envelope. But I am not sure what the specific implementation is so far.

We doubt that AMD actually needs to implement performance-boosting technologies into low-cost low-power APUs. Those, who use netbooks do not demand a lot of speed, but would rather appreciate long battery life. Those, who demand performance will be able to take advantage of TurboCore capabilities offered by Llano that is due in mid-2011.

Inside Ontario and Zacate

X-bit labs: I wonder whether you can describe how the two x86 cores and stream processing cores are connected between each other, what's the total bandwidth that you have inside the core, whether the GPU essentially sits on an internal PCI Express bus and so on?

Godfrey Cheng: There is an internal bus that we call common core interface, which is quite similar to PCI Express, but it is not just an internal PCI Express. It is a very efficient interface between the two cores.

X-bit labs: So, can we expect you to continue using that common core interface in future products?

Godfrey Cheng: Because we are not making a public specification, we have the ability to optimize the interconnects as compute loads or as programs change. What we chose for Llano and Ontario/Zacate might be very much different for the next generation. We are not going to physically recreate PCI Express as a proprietary spec internally, but we are going to optimize the interface as much as we can.

X-bit labs: All Evergreen graphics processors have pretty complex architecture of internal caches. Was this architecture inherited from Cedar chip, or does the GPU use the L2 cache of CPU cores?

Godfrey Cheng: When we integrated our discrete-level GPUs into APUs, we took the best what we had to offer from our graphics processing units, so, fundamental architecture and cache architecture are the same as in the standalone family. It really does not make sense for us to start reinventing the cache for our built-in GPUs because quite frankly our GPUs are world-class and that is not something we need to innovate.

Rivals: Sandy Bridge, Optimus...

X-bit labs: Intel Corp.'s Sandy Bridge processor manages to offer dramatically higher graphics performance thanks to improvements of the graphics cores as well as due to unified L3 cache for x86 and graphics core. However, you skipped unified L3 in case of both Llano and Ontario/Zacate. What are the reasons for that? Perhaps, common core interface is more efficient than what Intel uses?

Godfrey Cheng: I do not know a lot about Sandy Bridge architecture; but that type of cache/memory implementation will not be efficient for our architecture. We are basically took a discrete GPU and married it with a very good CPU and that type of architecture basically demands us to make considerations and decisions that are within memory controller and also how we use external memory. Using L3 might work well for Intel, but it will not work well for us, which is why we are diverging in this area.

X-bit labs: Did you consider a technology similar to Nvidia Optimus that would dynamically switch from integrated graphics to discrete within a Fusion platform?

Godfrey Cheng: The Optimus is an interesting technology. We have PowerXpress, which is a similar technology [ATI PowerXpress allows notebook users to manually or automatically switch between an ATI Mobility Radeon HD discrete graphics processor and an integrated AMD M780G with ATI Radeon HD 3200 graphics without rebooting their notebook]. What we have not been discussing broadly so far is that you get a lot of synergy between a Llano APU and a discrete GPU. What we were able to make close together works better than one plus one.

X-bit labs: But the GPU integrated into Llano APU consumes a lot less power than a standalone graphics processor...

Godfrey Cheng: Fairly, we have invested heavily to make sure that we do the right thing with power gating and so forth. The point is that when you plug in an AMD APU and an AMD GPU, you get better performance than you would with the same GPU and an Intel processor.

A Glance into the Future

X-bit labs: What is more progressive and economically feasible for entry-level and mainstream processors, to implement new instructions or certain special purpose accelerators or ALUs like those inside GPUs or a wide vector processing unit (those are similar though) to better serve future needs of computing?

Godfrey Cheng: The big difference between us and Intel is that we are not bound by the x86 ISA. So, we are going to apply the right silicon, the right programming model and the right APIs to what we see is the best way to handle solution. Decoding videos is really handled well by low-power fixed-function hardware like UVD. So, we have a lot of fixed-function units in our APUs. We have brand-new Bulldozer and Bobcat CPU cores, but at the same time we have world-class ALUs from our GPUs, so we do not really care what silicon we put there so long as we focus on the right solution. We do not care how we innovate as long as we deliver the best solution possible.

X-bit labs: Does it make sense to assume that the future APUs will have multiple parts working at different clock-speeds? For example, an antivirus monitor and a browser do not require high clock-speeds and they can work at 500MHz, while another core will work at 1.50GHz and higher?

Godfrey Cheng: Absolutely. For instance, in low-power settings you only need GPU partially active to refresh display and you need one core of the CPU to keep Windows operating system running. It is fundamental to our strategy to downclock as much as we can so to preserve power and the other strategy is to apply power to the core that needs it the most to increase clock-speed in whatever we need to do. Looking forward, and you might not see an implementation of this with the current-generation of APUs, we are going to clock different parts of CPU, or the APU, at different speeds or even shut down entire parts.

Software: GPGPU Is the Key

X-bit labs: Performance of Zacate (and Ontario) in x86-based applications appears to be pretty low, it barely beats Atom in a lot of cases and it offers lower performance than Pentium dual-core. But obviously it should be much faster in GPGPU-based applications. I wonder whether you have plans to ensure that actual system vendors install those GPGPU apps onto their netbooks or nettops with Fusion chips?

Godfrey Cheng: Benchmarks like Bapco Sysmark are not the things that the consumer does. When we looked at consumers and asked what did they want their computer to do,  we learnt that consumers want better Internet video, better video in general, better Web-surfing. These are the things that our [Ontario and Zacate] APUs are designed for; they are not designed to be benchmark winners. Instead, we are delivering great performance, great functionality and the great experience at very good TDPs. We are not going to chase the benchmark wars with our friends at Intel, that is not our goal. But if you look at products based on those APUs, people will get great experience out of them.

We are definitely working with third-party software application companies to take advantage of parallel compute and fixed-function capabilities of our APUs. You will see companies shipping software next year that will be optimized for our accelerated processing units.

X-bit labs: When do you expect a wide range of consumer applications to start using GPUs to speed up stream processing?

Godfrey Cheng: We have been working with software developers for a while, we should expect to see such applications in 2011. [...] We are working with multiple companies and there are multiple applications that benefit from general purpose computing on GPUs and thus from our ALUs in the pipeline as a part of our Fusion Fund initiative.

X-bit labs: At present software determines which (x86 or GPU) cores to use. But would you expect future chips to automatically perform the load balance and decide which hardware to use for particular operations?

Godfrey Cheng: That part is really done on the compiler, API and application levels. We are definitely working on OpenCL application programming interface and at some point an OpenCL variation will allow you to choose whether you put [your workload on] your GPU or your CPU. Basically, we are agnostic as to where run the code as long as it runs efficiently; but software companies will have controls as to where exactly it runs, it is not at the chip level.

When we work with software developers, we say, "we have an APU that does this, this and this, why don't you try your code-path in different ways?" Ultimately, they will decide what is best for their software architecture.

What has never happened is that by offering them x86, ALUs and fixed-function units they can approach patterning their problems differently. If we can offer them 90GFLOPS+ of compute power in an eighteen watt package, they might approach that and use those capabilities that they never had access to. That is what is different [with APUs].

Strategy: The Future Is Fusion

The Place for Fusion

X-bit labs: We. How will you position Ontario and Zacate against Mobile Phenom II or Turion-series processors?

Fusion APUs are the future of AMD. We will phase them - Ontario, Zacate and Llano - into the product mix over 2011 and transition out products they replace.

X-bit labs: Will you continue to use Fusion brand for Ontario and Zacate when they enter the market, or do you plan a separate brand name?

Godfrey Cheng: Ontario and Zacate will come to market under the "AMD Vision Technology" brand.

After the massive success of Intel Atom, the industry observers started to talk about cannibalization of notebooks by netbooks. When Apple iPad tablet hit the market, many began to observe cannibalization of netbooks by slates. Interestingly, but the first breed of AMD Fusion accelerated processing units will be able to compete against entry-level desktop and mobile chips, whereas the second breed Llano offerings may be even even able to fight against the most affordable discrete graphics cards.

X-bit labs: Do you think that APUs represent a threat to inexpensive discrete GPUs?

Godfrey Cheng: Absolutely not. We have always offered excellent graphics solutions. [By contrast], no matter where you go, the slowest graphics is always "blue". Even with what we have heard of Sandy Bridge, that does not change much. If you look at where Llano is going to be versus Sandy Bridge, we believe that we are going to be quite a bit faster still. And even though we are quite a bit faster, consumers want more! As a result, we are going to deliver faster and better discrete GPUs all the time.

Despite the raised lowest common denominator, which is the APU, higher-performance discrete GPUs will offer enough extra performance that we think the market will consume. So, is APU a threat? No, not for us.

X-bit labs: By releasing the Ontario and Zacate chips, you are blurring the borders between netbooks and notebooks. Aren't you afraid that those APUs will kill sales of notebook platforms? Will you impose any requirements for Ontario/Zacate APUs-based computers (e.g., small monitor, etc)?

Godfrey Cheng: Fusion is raising the processing and experience bar at all levels.  Ontario and Zacate will deliver great experiences and raise the bar at prices points occupied by function limited netbooks and low-end notebooks today. Llano will deliver even more performance and better experiences beyond Zacate and Ontario-based products. AMD will not create any artificial barriers for our customers and encourages them to innovate with our Fusion APUs.

Functionality - Trigger of Revolution

X-bit labs: Can you name three key benefits that Brazos platform has against currently available inexpensive platforms for netbooks and low-cost notebooks?

Godfrey Cheng: Brazos has excellently balanced performance on x86, graphics and parallel compute. If you look at typical Atom platform today, it basically has out-of-date in-order x86 core [as well as low-end graphics adapter].

We have out-of-order x86 core. We have DirectX 11 graphics engine with UVD 3.0 video decoder. We have GPGPU with DirectCompute and OpenCL support. Our 9W/18W solution provide performance and feature-set comparable to competitor's 35W solutions...

Those are the obvious benefits of Brazos over existing Atom platforms.

X-bit labs: Will Ontario and Zacate enable lower-cost systems? At present the gap between netbooks and notebooks is shrinking as netbooks are getting more expensive. Perhaps, Ontario or Zacate-based machine will cost less than those?

Godfrey Cheng: Our goal is not to lower the cost of systems because that would be counter-productive to our corporate goals. That is really not that interesting to the consumer either. Our goal is to provide more functionality at certain price-point(s). The new AMD mobile platforms will deliver more than we delivered last time.

X-bit labs: What are the starting prices of mobile computers based on Zacate and Ontario chips?

Godfrey Cheng: We believe we should see notebooks in around the $349 range.

X-bit labs: Those low-cost systems are not designed for demanding applications and nobody actually launched them on netbooks. Do you expect Fusion to change usage models of PCs?

Godfrey Cheng: The level of expectations set by Atom netbooks is so low that Zacate-based machines with all their benefits will redefine what a $349 notebook should be. We absolutely intend to revolutionize the value at those price-points.

Consumer Electronics?

X-bit labs: At present Intel and other chip companies are trying to enter the market of consumer electronics with Google TV platform. Are you interested in this particular platform or similar devices with your Ontario and Zacate APUs?

Godfrey Cheng: We are absolutely interested in expanding our market. The question for us is when we make an investment, we have to be absolutely sure that our bets will pay off. So, we are looking at tablets, we are looking at Internet TVs and how we can position products for that. Ultimately, we need to have the right product to address the tablet and the Internet TV market.

X-bit labs: Actually, the Ontario APU with 9W TDP seems to be an excellent product for set-top-boxes and Blu-ray players with rich multimedia functionality, Blu-ray 3D support...

Godfrey Cheng: We would absolutely loved to get into that market, but as you know there is a big element software to it. So, we need not only to have the right hardware product, but we need have the right solution for our [potential] customers. Keep in mind, we are a small company: where we invest, we need to win.

X-bit labs: Thank you for your answers and good luck!