Heterogeneous Computing Performance
Promoting its hybrid processors, AMD keeps on reminding us that the integrated graphics core can be used to accelerate general-purpose computations. That’s true. The OpenCL and DirectCompute frameworks that enable parallel computing on both x86 and graphics cores are supported by both the AMD Trinity and the Intel Ivy Bridge series. And while they used to be used by very few specialized applications, the idea of heterogeneous computing has become much more widespread now. That’s why we want to run some performance tests in applications that can make full use of all the resources provided by hybrid processors. There are quite a lot of such applications available so we can pick up a few popular ones. Thus, our testing will have some practical worth, too.
We want to start out with simple tasks of video decoding and transcoding. Modern hybrid processors do this by utilizing their graphics cores, but not their shader processors. There are specialized subunits for that purpose. Intel calls it Quick Sync whereas in AMD APUs these subunits are referred to as UVD3 and VCE.
Today’s processors have no problems playing back HD video in various formats. Hardware video decoding works perfectly even when it comes to playing a 1080p stream at 60 fps and high bitrate. However, as higher resolutions and bitrates get more popular, inexpensive processors may find it difficult to cope. For example, we used in our test a widescreen 4096x1744p@24fps clip encoded in H.264 format with a bitrate of about 34 Mbps. If played via DXVA with enabled hardware decoding, we have dropped frames. And the number of dropped frames depends directly on the CPU's capabilities. The diagram below shows the average number of displayed frames when the test video is reproduced in the software player Media Player Classic – Home Cinema version 1.6.5. We enabled subtitles to make the test even more difficult.
We’ve got some unusual results playing our 4K video. The A10-5800K and A8-5600K APUs are the best with a minimum of dropped frames. The two Core i3 processors are somewhat worse, closely followed by the A10-5700 and A8-5500. The A6, A4, Pentium and Celeron processors are on the losing side in this test, dropping about half of all the frames in the test video.
Well, there is actually no processor that copes with decoding our 4K video perfectly. None of them can be used in a truly versatile media center. As UHD and 4K formats get more popular, users may find it difficult to play movies and video clips with the best possible quality. Software players may get optimized to improve this situation, yet it would be safer to rely on higher-performance hardware components instead.
The other popular video processing task is transcoding. Today, every graphics core developer has realized that specialized transcoders should be integrated into their solutions. We checked out the transcoding capabilities of the Trinity and Ivy Bridge processors using CyberLink MediaEspresso 6.7 that supports both Intel Quick Sync and AMD VCE. During this test, a 1.5GB 1080p H.264 video clip (a 20-minute episode of a TV series) was transcoded into a lower-resolution format for viewing on an iPad 2 (H.264, 1280x768 pixels, 6 Mbps).
The results of the Celeron and Pentium processors are indicative of how important hardware acceleration is for that task. Intel disables Quick Sync in its junior CPU models and their transcoding speed is comparable to the length of the original video. The Core i3 series has Quick Sync and performs the job 10 times faster. We can also note that the senior version of the graphics core, the HD Graphics 4000, is faster by a third, so it differs from the HD Graphics 2500 in this respect as well, not only in the number of execution units.
Anyway, Quick Sync remains the fastest hardware transcoding solution in its every implementation. The Trinity series with its VCE technology is only one third as fast as their opponents in this test. VCE delivers the same transcoding performance in every APU, by the way. The only exception is the A4-5300 model which is about 20% slower than its cousins.
Video transcoding and playback are undoubtedly most important tasks for home computers, but we are interested in how modern hybrid processors do in true heterogeneous applications that run both on x86 cores and shader processors. A significant indicator that the hybrid processor concept has been accepted by the software market is the fact that OpenCL support is added to the popular data compression tool WinZIP. Its 17th version can use GPU resources to compress files, the x86 and graphics cores sharing the load in the following way:
According to the diagram, it is the x86 cores that do the bulk of work, yet the GPU can help a lot. So it is no wonder that the advanced graphics core ensures a substantial performance boost for AMD’s Socket FM2 processors in WinZIP.
The diagram shows that AMD’s effort in promoting the heterogeneous computing concept has not been wasted. The Radeon HD graphics cores implemented in the Trinity APUs really help improve their performance. As a result, the A10 and A8 APUs are as fast as the Core i3 series, the overall picture being different from what we've seen in conventional applications that do not use graphics core resources. The junior dual-core Socket FM2 APUs do not do as well as their senior cousins, though. They are still much slower even than the Celeron G1620.
We should keep it in mind that OpenCL can’t make AMD’s APUs superior to their opponents everywhere. The GPU-based acceleration of computations can only be achieved in specific algorithms that allow decomposing the original task into a lot of identical subtasks. That’s why the majority of heterogeneous software is concerned with image and video processing.
The image editor GIMP is a good example of such an application. In its latest version it features a library of filters with support for OpenCL acceleration. As opposed to WinZIP, these filters are almost exclusively performed on the graphics core whereas the x86 cores only do some auxiliary work.
So it is no wonder that GIMP runs better on high-performance graphics hardware. As an illustration, we can show you the speed of sequential execution of three resource-consuming effects: Gaussian blur, Motion blur and Bilateral.
GPU-based performance acceleration is something you can actually notice. Under favorable conditions, the graphics core’s shader processors can ensure a substantial performance boost. The graphics core architecture of AMD’s Trinity APUs is not only faster than Intel’s HD Graphics but also more optimized for computing. So when the application, like GIMP, is OpenCL-optimized, AMD’s APUs can deliver an outstanding performance in comparison with their Intel counterparts. The Core i3-3225 with the most advanced version of Intel’s integrated graphics is only as fast as the junior Socket FM2 processor AMD A4-5300 when it comes to applying these image filters. The other Intel CPUs are much slower.
Another example of a popular OpenCL-compatible application is the professional video editing tool Sony Vegas Pro 12. When rendering video, it can distribute the load among all the computing resources of hybrid processors.
It must be noted that Intel’s graphics cores are not compatible with that application for some reason although the Ivy Bridge is specified to have no limitations in terms of its OpenCL support. Anyway, owners of LGA1155 systems can only rely on the conventional x86 computing resources here. On the other hand, this fact doesn’t prevent the Intel CPUs to look better in this test of video rendering in Sony Vegas Pro than in the previous case.
AMD’s quad-core Trinity APUs are about as fast as Intel’s Core i3 in Sony Vegas Pro. The dual-core A6 and A4 series models compete successfully with the Pentium and Celeron CPUs.
Next we tested our processors in SVPMark 3. It is a specialized performance benchmark for the SmoothVideo Project software which improves video playback by inserting new intermediary frames into the video stream. This software makes active use of GPU resources via OpenCL.
Well, the APU load graph shows that it is the x86 cores that do the bulk of work here.
However, we still see AMD’s Socket FM2 A10 and A8 APUs outperform Intel’s Core i3. Judging by the difference between the Core i3-3225 and the Core i3-3220, the graphics core’s performance is important for this benchmark, so it is no wonder that the quad-core Trinity models are in the lead. The dual-core A6 and A4 APUs look good, too.
The results suggest that heterogeneous load is what the Socket FM2 platform needs to show its best. Intel’s CPUs, excepting perhaps the Core i3-3225, are not strong under such conditions. So if you plan to use video or image editing applications with OpenCL support, you may want to consider the graphics core’s performance while choosing the optimal platform. This factor may affect your platform’s speed in such applications even more than in 3D games.
We should keep it in mind that the integrated graphics core can only be employed for general-purpose computing if there is no discrete graphics card in the system. If it is installed, the integrated graphics is disabled in the processor, so the whole APU concept is only applicable to integrated platforms. But when the system includes a discrete graphics core, the integrated GPU has no effect on 3D or heterogeneous computing performance, which means that the computing performance of x86 cores remains the main factor for choosing CPUs for classic discrete PC configurations.