Flashforward: Intel’s 48-Core Chip Unleashed

Intel’s 48-core single chip cloud computer (SCC) unveiled just about half a year ago has quickly gained a lot of attention to itself as not only the world’s first x86 processor with 48 processing engines, but also as a potential successor of the infamous Larrabee. Today we are speaking with one of the co-designers of the SCC in order to find out more about the ambitious project that is not supposed to come alive. In addition, we have an independent expert Jon Peddie to tell us about the future of CPUs and GPUs.

by Anton Shilov
05/30/2010 | 08:06 AM

Recently Intel demonstrated its system powered by the 48-core processor and also began to supply its select partners with appropriate machines with one simple purpose: to promote multi-core programming research as well as determine future directions of microprocessor development. The findings of the research project may eventually revolutionize the whole approach to hardware and software development.

 

In this interview we are talking with Sebastian Steibl, one of the developers of the single chip cloud computer, in order to find more about the SCC, the goals of the project and the future of computing. In order to have an expert and unbiased view onto the future of processing in general, we also asked several questions to Jon Peddie, the head of the Jon Peddie Research company.

X-bit labs: Hello, can you introduce yourself and briefly describe what you do at Intel.
Sebastian Steibl: My name is Sebastian Steibl and I am the director of Intel Labs Braunschweig, which is a part of global Intel Labs organization, which is headed by Intel’s chief technology officer Justin Rattner. I was one of the two design managers for the single chip cloud computer (SCC), I was involved into the design of both silicon and the platform itself. The SCC is a research chip, it is not coming out from the product groups, it is not intended to be a product, it was designed by a small team of researchers with the purpose to foster many-core programming research as well as some microprocessor research angle.

In fact, Intel Labs Braunschweig is a part of the Intel Labs Europe organization, which has twenty labs in the EU and around 900 of researchers. The institution by itself is rather big (startup companies have tens of employees, not hundreds) and considering the research nature of the group, it has a huge potential.

I am back – Pentium

X-bit labs: There are talks that Intel SCC has 48 P54C cores, the Pentium cores. Is this information correct, or the cores inside the SCC are more advanced than those, which were meant to have their place inside Larrabee graphics processor?

Sebastian Steibl: We call them Pentium-class cores. Their characteristics are ‘small core with in-order short pipeline’, which is identical to P54C and very comparable to the Pentium. 

X-bit labs: What is the reason why you decided to use those very old cores?

Sebastian Steibl: The main goal of the SCC was building a research vehicle that could include the highest amount of cores possible on the given die. We had to do tradeoffs with respect to performance per core and area, so, we could take more powerful cores, but then we would not have a capability to install so many cores. […] as we have found, the Pentium-class cores is a sweet spot in terms of the programming tools support. The Pentium is still supported by all modern programming tools. If you go with smaller cores [and less advanced micro-architecture], then support of standard compilers will be [more than] limited. This is the reason why we decided to use the Pentium: it is well supported by world-class development tools, it is fully IA-compatible and it is small, so you can put many of them onto the die.

X-bit labs: What is the amount of transistors inside Intel SCC?

Sebastian Steibl: Around 1.3 billion.

X-bit labs: How high is the projected performance of the SCC in terms of GLOPS/TFLOPS is?

Sebastian Steibl: Actual performance was not a design target. Pentium cores without vector FP units [process] a floating number which is not that high. Hence, we targeted to increase the parallelism within the chip rather to achieve the maximum amount of TFLOPS. If we had put additional FP vector units [like those inside Larrabee] into the SCC, we would need more area and, what is even more important, power.

Since we wanted to build a part that is as parallel as possible to advance many-core software research, we took the decision to not go for high floating point (FP) performance.

Many-Core – the Evolution

Although Intel is investing huge amounts of money for the new micro-architectures, the recent processors from the chip giant – Atom, Core 2 – are based on relatively old, but very well advanced P5/Pentium and P6/Pentium Pro micro-architectures. Moreover, even the infamous Larrabee graphics processor was supposed to be powered by the P54C cores. Logically it may seem that the world’s largest processor maker seems to believe that having loads of cores is more important than having the best single-threaded core.

But it looks like everything is not that simple: Intel still sees single-thread performance as a target and many-core architecture as a research.

X-bit labs: Do you think that in future multi-core processors will sport many simple cores, but not a limited number of advanced cores, like today? Or maybe they will contain of as many advanced cores as possible?

Sebastian Steibl: Being a research person, I cannot comment on decisions of product groups. I can only comment on the trends we see. If you look at today’s software, even today’s multi-core software, to a large extent a lot of things need [maximum] single-thread performance. There are applications that benefit from single-thread performance, but there are [also] applications [which take advantage of] multi-thread performance.

There are implications for future processors. Single-thread performance will continue for a foreseeable amount of time be a requirement. The software industry at large still needs tools to be better prepared for exploit parallelism and this is this is one of the reasons why we have decided to give out SCC to interested academic institutions to advance methods and productivity enhancements for [multi-core] programming research.

For the foreseeable future large cores will play an important role.

There are may ways to boost performance of microprocessors. One of the ways to implement new instructions, like I case of Intel AVX. Another way is to implement special-purpose accelerators, like in case of AMD Fusion program. But which way is better?

X-bit labs: What is more progressive and economically feasible for high-end processors, to implement new instructions or certain special purpose accelerators or ALUs like those inside GPUs or a wide vector processing unit (those are similar though)?

Sebastian Steibl: I think they are similar [approaches]… In high-performance processing we need vector units, which we have been adding [for ages now] and we are getting good results out of it . In the mobile space, accelerators play an important role since mobile computing becomes more and more dominant. I could see a future that will have both of them. We actually have research programs that look into both dimensions.

X-bit labs: Do you think that x86 with accelerators (VLIW vector units, whatever) will be able to rapidly process graphics?

Jon Peddie: Absolutely! Intel's new Sandy Bridge processor, AMD's new Llano processor and its little brother, Ontario, will do just that. We will have the hardware readily available by 2011, [but] the software to exploit it will probably not be available until 2012 or 2013.

X-bit labs: What is the reason why Intel decided to allow software to determine the amount of cores to use? Maybe usage of an ultra-threaded dispatch processor – like the one used in GPUs – would be more efficient in terms of efficiency and complexity of software?

Sebastian Steibl: The SCC is a research vehicle, we wanted it be as experimental platform as possible. Having this architecture, we have software data flow, management of execution; it is much better – for a development platform – to have this kind of capability rather than to have a fixed-function unit. Maybe, a fixed-function [data scheduler] is more efficient, but having this program [allows us to] give more flexibility to software organizations.

X-bit labs: Do you believe that many-core CPUs (in a decade or more from now or so) will be able to process both general data and graphics?

Jon Peddie: If graphics operations can be reduced to geometric functions like tessellation and image functions like shaders, without the need for specialized processors (such as a texture processor, video scalar, or colour lookup tables). Then the answer is absolutely ‘yes’. Furthermore, I think it will happen less than five years and probably as soon as three. The reason for specialized processors is to overcome the processing time of scalar processors. With multi-core CPUs and high clock rates, we can be extravagant with the software load and run general purpose processors for highly specialized applications.

The Level Two

Although many consider the speed of level-two cache of client processors unimportant, it more than is in case of servers and/or enterprise systems with many processors. Today’s multi-core chips have hardware level-two (L2) cache coherency, e.g., they need to have the same data in L2 no matter in which processor sockets they are.

But in case of the SCC, the cores are not cache-coherent, which is a questionable thing at the first glance. For example, once a core needs to get the data from a core that is within the mesh network, but that is relatively “far away”, the whole system waits till one core/node gets the data from another. This is another reason for the SCC existence: to find out programming models that obey the aforementioned program and minimize data exchange between the nodes.

X-bit labs: Do I understand it correctly that the lack of hardware L2 cache coherency between the cores was implemented in order to reduce the amount of bandwidth needed from the mesh network? Or you believe than in future multi-core processors will not need L2 cache coherence?

Sebastian Steibl: It really depends on the application – for example, message passing comes with a [significant] overhead – and the data locality. But I think that the lack of coherency simplifies the design of the chip and decreases power consumption of the mesh network [in case of the SCC]. 

X-bit labs: So, the assumption that the future applications will not need cache coherency is not exactly correct?

Sebastian Steibl: The point is that [we wanted to find out] whether we need cache coherency in its current form for such parallel computers. All the architectures today are actually cache coherent, so, we intentionally decided to be a non-cache coherent architecture to see how far you can go without hardware cache coherency. [Software developers now] can manage the cache coherency in software. The reason we have this is because is super-computers usually you are not coherent; if there are thousands of nodes in an HPC case, you are not coherent. So, we do know that there is a working scaling programming model without coherency. The current programming “on-die” model is fully coherent, so, we wanted to see if the [HPC] model also works on a large number of cores. It is an active experiment to see whether the lack of hardware coherency is really a limiting factor for parallel software in case of 50, 100 or even more “nodes”.

Multifrequency and Hetero

For many years microprocessors have been classified by their clock-speed. With the emergence of the multi-core era, the frequency became less important. However, the future central processing units will have different clock-speeds inside them and the frequency characteristic will not pose a significant role.

X-bit labs: Each of the tiles can run at its own clock-speed, yet, the mesh network seems to run at a constant clock-speed, whereas memory controller runs at yet another clock-speed. Will the chips of the future all work at different internal clock-speeds for different parts of microprocessors?

Sebastian Steibl: I am not from the product groups, so, I cannot comment on actual products. But we have built this research chip… And I will be surprised if we continue to see if microprocessors will stay at the same clock-speed forever. My personal opinion, as a researcher, is that what you say is true. There are good reasons for staying at the same frequency though, for instance, clocking power-gates. Moreover, there are alternatives [to difference of clock-speeds] – we can slow parts of the chips down or completely disable a part [in order to reduce power consumption] for certain clock cycles. I think that [eventually] we will see different parts of a chip operating at different performance points, according to the task.

X-bit labs: Perhaps, a heterogeneous multi-core approach is better? (AMD Llano is, but not limited to, one of such approaches)?

Jon Peddie: Heterogeneous processing are today and the most ideal situation, with a few caveats: load-balancing of the applications - the need for scalar processing, and vector processing is still very complicated and inefficient. It is only accomplished through explicit instructions in an application. When the operating system, or a resident kernel of an operating system, is able to parse the application's needs and direct the work to the appropriate processor (scalar, vector, matrix, etc.) the efficiencies of an integrated heterogeneous processor will be overwhelming. Hardware has led software by an increasing number of years. In the early 80s, hardware was insufficient for the software. In the early 90s, hardware, gained parity with software. In the new millennium hardware capabilities have been exceeding the demands of software by about six months every other year. There is no Moore's law it seems for software development.

X-bit labs: What do you think about AMD's (and eventually Intel's) "Fusion" approach? Will it work out? It is already happening in certain markets, though...

Jon Peddie: It not only will work out, it is inevitable, and essential. It is inefficient to physically separate scalar and vector processors. Advantages of inter-processor communicating via an L3 cache [and additional logic] are too compelling to be ignored. With the new process nodes (32nm and smaller). The construction of these ultra-complex machines is economically feasible. 

X-bit labs: How could you manage to squeeze 48-cores along with additional logic into 125W TDP? It is a remarkable achievement, by the way.

Sebastian Steibl: We have certain abilities to aggressively manage power consumption: different voltage and frequency domains are present within the SCC. But we also had to do a number of design trade-offs.

Memory of the Future

X-bit labs: Intel once said that the SCC packs an energy efficient DDR3 quad-channel memory controller. Can you provide any additional information about this? What is the difference between this memory controller and those used today?

Sebastian Steibl: The memory controllers are optimized for different operating points. One memory controller has to serve twelve cores, which is quite a high number in terms of parallelism. Hence, the memory controllers are optimized to better serve many single threads rather than serving smaller number of multi threads. The memory controller is optimized for more parallel accesses, whereas traditional memory controllers are more optimized for performance in limited amount of threads.  

X-bit labs: So, that is the memory controller of the future?

Sebastian Steibl: It is the memory controller suitable for many-core architecture. [It also has] a good reason for “large” cores as well.

X-bit labs: What was the response from the software development community on the SCC? Have you already shipped those systems to software makers?

Sebastian Steibl: We have started to share [SCC-based systems] with a very select external partners. There are some interesting results coming out and I guess pretty interesting news are incoming over summer. Those parties have to comment on their work [which has been done].

Time Will Tell

There is a million ways to increase performance of computing. One is to implement larger general processors, another is to incorporate many cores into a chip, yet another is to produce a many-core heterogeneous solution. But there is a problem: the actual software should be ready for the hardware. Within the SCC framework, Intel, at least partly, gives a solution of a problem. What happens next? Only time will tell.