Bookmark and Share


All modern microprocessors have their own high-speed memory banks, known as “caches,” which store frequently used data. Traditionally, managing the caches has required fairly simple algorithms that can be hard-wired into the chips. But in a bid to increase performance of modern multi-core chips, more advanced algorithms may be required. In many cases, such algorithms can be implemented better in software rather than in hardware.

Daniel Sanchez, an assistant professor in MIT’s department of electrical engineering and computer science, believes that it is time to turn cache management over to software. Recently, Mr. Sanchez and his student Nathan Beckmann presented a new system, dubbed Jigsaw, that monitors the computations being performed by a multi-core chip and manages cache memory accordingly. In experiments simulating the execution of hundreds of applications on 16- and 64-core chips, Sanchez and Beckmann found that Jigsaw could speed up execution by an average of 18% – with more than twofold improvements in some cases – while actually reducing energy consumption by as much as 72%. The researchers believe that the performance improvements offered by Jigsaw should only increase as the number of cores does.

Location, Location, Location

In most multi-core chips, each core has several small, private caches. But there is also what is known as a last-level cache, which is shared by all the cores. The LLC takes up to 40% - 60% of the chip size, but that is justifiable because it is crucial to performance. Without that cache, some applications would be an order of magnitude slower.

Physically, the last-level cache is broken into separate memory banks and distributed across the chip; for any given core, accessing the nearest bank takes less time and consumes less energy than accessing those farther away. But because the last-level cache is shared by all the cores, most chips assign data to the banks randomly.

Jigsaw, by contrast, monitors which cores are accessing which data most frequently and, on the fly, calculates the most efficient assignment of data to cache banks. For instance, data being used exclusively by a single core is stored near that core, whereas data that all the cores are accessing with equal frequency is stored near the center of the chip, minimizing the average distance it has to travel. Jigsaw also varies the amount of cache space allocated to each type of data, depending on how it’s accessed. Data that is reused frequently receives more space than data that is accessed infrequently or only once.

In principle, optimizing cache space allocations requires evaluating how the chip as a whole will perform given every possible allocation of cache space to all the computations being performed on all the cores. That calculation would be prohibitively time-consuming, but by ignoring some particularly convoluted scenarios that are extremely unlikely to arise in practice, Mr. Sanchez and Mr. Beckmann were able to develop an approximate optimization algorithm that runs efficiently even as the number of cores and the different types of data increases dramatically.

Quick Study

“Since the optimization is based on Jigsaw’s observations of the chip’s activity, it is the optimal thing to do assuming that the programs will behave in the next 20 milliseconds the way they did in the last 20 milliseconds. But there is very strong experimental evidence that programs typically have stable phases of hundreds of milliseconds, or even seconds,” said Mr. Sanchez

The researcher also points out that the new exploration represents simply his group’s “first cut” at turning cache management over to software. Going forward, they will be investigating, among other things, the co-design of hardware and software to improve efficiency even further and the possibility of allowing programmers themselves to classify data according to their memory-access patterns, so that Jigsaw does not have to rely entirely on observation to evaluate memory allocation.

“More and more of our computation is happening in data centers. In the data-center space, it is going to be very important to be able to have the microarchitecture partition and allocate resources on an application-by-application basis. When you have multiple applications that are running inside a single box, there is a point of interference where jobs can hurt the performance of each other. With current commodity hardware, there are a limited number of mechanisms we have to manage how jobs hurt each other,” said Jason Mars, an assistant professor of computer science at the University of Michigan.

Mr. Mars cautions that a system like Jigsaw dispenses with a layer of abstraction between chip hardware and the software running on it.

“Companies like Intel, once they expose the micro-architectural configurations through the software layer, they have to keep that interface over future generations of the processor. So if Intel wanted to do something audacious with the microarchitecture to make a big change, they will have to keep that legacy support around, which can limit the design options they can explore. However, the techniques in Jigsaw seem very practical, and I could see some variant of this hardware-software interface being adopted in future designs. It’s a pretty compelling approach, actually,” explained Mr. Mars.

Tags: MIT, Intel, AMD, Nvidia, IBM, x86, ARM


Comments currently: 2
Discussion started: 09/19/13 04:00:48 PM
Latest comment: 09/19/13 05:59:34 PM
Expand all threads | Collapse all threads


They created the Jigsaw algorithm and showed that it's very useful. Now, instead of trying to squeeze low level cache control into future software, go ahead and implement this (or a similar / better) algorithm into the hardware.

This shouldn't be about turning cache management to software, but rather about makingmaking the hardware cache management better.
2 0 [Posted by: sanity  | Date: 09/19/13 04:00:48 PM]
- collapse thread

the problem is that not all algorithms can be efficiently converted to hardware nor does the hardware know what the os is it is just too abstract to efficiently implement in hardware.

what could be done is a special purpose core that manages the cache and memory but it would be commanded by the os. because it knows which task is done the a core.
1 0 [Posted by: massau  | Date: 09/19/13 05:59:34 PM]


Add your Comment

Latest News

Wednesday, November 5, 2014

10:48 pm | LG’s Unique Ultra-Wide Curved 34” Display Finally Hits the Market. LG 34UC97 Available in the U.S. and the U.K.

Wednesday, October 8, 2014

12:52 pm | Lisa Su Appointed as New CEO of Advanced Micro Devices. Rory Read Steps Down, Lisa Su Becomes New CEO of AMD

Thursday, August 28, 2014

4:22 am | AMD Has No Plans to Reconsider Recommended Prices of Radeon R9 Graphics Cards. AMD Will Not Lower Recommended Prices of Radeon R9 Graphics Solutions

Wednesday, August 27, 2014

1:09 pm | Samsung Begins to Produce 2.13GHz 64GB DDR4 Memory Modules. Samsung Uses TSV DRAMs for 64GB DDR4 RDIMMs

Tuesday, August 26, 2014

10:41 am | AMD Quietly Reveals Third Iteration of GCN Architecture with Tonga GPU. AMD Unleashes Radeon R9 285 Graphics Cards, Tonga GPU, GCN 1.2 Architecture