by Anton Shilov
02/07/2012 | 01:58 PM
Researchers from North Carolina State University have developed a new technique that allows to improve performance of AMD Fusion or Intel Sandy Bridge hybrid chips by an average of more than 20%. The engineers propose to take advantage of unique features of x86 microprocessors, such as data pre-fetching or large caches, to speed up execution of highly-parallel tasks on graphics processing units.
“Chip manufacturers are now creating processors that have a ‘fused architecture,’ meaning that they include CPUs and GPUs on a single chip. This approach decreases manufacturing costs and makes computers more energy efficient. However, the CPU cores and GPU cores still work almost exclusively on separate functions. They rarely collaborate to execute any given program, so they aren’t as efficient as they could be. That’s the issue we’re trying to resolve,” said Dr. Huiyang Zhou, an associate professor of electrical and computer engineering who co-authored a paper on the research.
Central processing units (CPUs) have less computational power than graphics processing units (GPUs) – but are better able to perform more complex tasks and have a number of special-purpose units that are not present on graphics processors.
“Our approach is to allow the GPU cores to execute computational functions, and have CPU cores pre-fetch the data the GPUs will need from off-chip main memory. This is more efficient because it allows CPUs and GPUs to do what they are good at. GPUs are good at performing computations. CPUs are good at making decisions and flexible data retrieval,” said Mr. Zhou
In other words, CPUs and GPUs fetch data from off-chip main memory at approximately the same speed, but GPUs can execute the functions that use that data more quickly. So, if a CPU determines what data a GPU will need in advance, and fetches it from the main memory, that allows the GPU to focus on executing the functions themselves – and the overall process takes less time.
In the proposed CPU-assisted GPGPU, after the CPU launches a GPU program, it executes a pre-execution program, which is generated automatically from the GPU kernel using the proposed compiler algorithms and contains memory access instructions of the GPU kernel for multiple threadblocks. The CPU pre-execution program runs ahead of GPU threads because (1) the CPU pre-execution thread only contains memory fetch instructions from GPU kernels and not floating-point computations, and (2) the CPU runs at higher frequencies and exploits higher degrees of instruction-level parallelism than GPU scalar cores. The researchers also leverage the prefetcher at the L2-cache on the CPU side to increase the memory traffic from CPU. As a result, the memory accesses of GPU threads hit in the L3 cache and their latency can be drastically reduced. Since the pre-execution is directly controlled by user-level applications, it enjoys both high accuracy and flexibility. Engineers' experiments on a set of benchmarks show that our proposed preexecution improves the performance by up to 113% and 21.4% on average.
The paper, “CPU-Assisted GPGPU on Fused CPU-GPU Architectures”, will be presented in late February at the 18th International Symposium on High Performance Computer Architecture, in New Orleans. The paper was co-authored by NC State students Yi Yang and Ping Xiang, and by Mike Mantor of Advanced Micro Devices. The research was funded by the National Science Foundation and AMD.