The Level Two
Although many consider the speed of level-two cache of client processors unimportant, it more than is in case of servers and/or enterprise systems with many processors. Today’s multi-core chips have hardware level-two (L2) cache coherency, e.g., they need to have the same data in L2 no matter in which processor sockets they are.
But in case of the SCC, the cores are not cache-coherent, which is a questionable thing at the first glance. For example, once a core needs to get the data from a core that is within the mesh network, but that is relatively “far away”, the whole system waits till one core/node gets the data from another. This is another reason for the SCC existence: to find out programming models that obey the aforementioned program and minimize data exchange between the nodes.
X-bit labs: Do I understand it correctly that the lack of hardware L2 cache coherency between the cores was implemented in order to reduce the amount of bandwidth needed from the mesh network? Or you believe than in future multi-core processors will not need L2 cache coherence?
Sebastian Steibl: It really depends on the application – for example, message passing comes with a [significant] overhead – and the data locality. But I think that the lack of coherency simplifies the design of the chip and decreases power consumption of the mesh network [in case of the SCC].
X-bit labs: So, the assumption that the future applications will not need cache coherency is not exactly correct?
Sebastian Steibl: The point is that [we wanted to find out] whether we need cache coherency in its current form for such parallel computers. All the architectures today are actually cache coherent, so, we intentionally decided to be a non-cache coherent architecture to see how far you can go without hardware cache coherency. [Software developers now] can manage the cache coherency in software. The reason we have this is because is super-computers usually you are not coherent; if there are thousands of nodes in an HPC case, you are not coherent. So, we do know that there is a working scaling programming model without coherency. The current programming “on-die” model is fully coherent, so, we wanted to see if the [HPC] model also works on a large number of cores. It is an active experiment to see whether the lack of hardware coherency is really a limiting factor for parallel software in case of 50, 100 or even more “nodes”.
Multifrequency and Hetero
For many years microprocessors have been classified by their clock-speed. With the emergence of the multi-core era, the frequency became less important. However, the future central processing units will have different clock-speeds inside them and the frequency characteristic will not pose a significant role.
X-bit labs: Each of the tiles can run at its own clock-speed, yet, the mesh network seems to run at a constant clock-speed, whereas memory controller runs at yet another clock-speed. Will the chips of the future all work at different internal clock-speeds for different parts of microprocessors?
Sebastian Steibl: I am not from the product groups, so, I cannot comment on actual products. But we have built this research chip… And I will be surprised if we continue to see if microprocessors will stay at the same clock-speed forever. My personal opinion, as a researcher, is that what you say is true. There are good reasons for staying at the same frequency though, for instance, clocking power-gates. Moreover, there are alternatives [to difference of clock-speeds] – we can slow parts of the chips down or completely disable a part [in order to reduce power consumption] for certain clock cycles. I think that [eventually] we will see different parts of a chip operating at different performance points, according to the task.
X-bit labs: Perhaps, a heterogeneous multi-core approach is better? (AMD Llano is, but not limited to, one of such approaches)?
Jon Peddie: Heterogeneous processing are today and the most ideal situation, with a few caveats: load-balancing of the applications - the need for scalar processing, and vector processing is still very complicated and inefficient. It is only accomplished through explicit instructions in an application. When the operating system, or a resident kernel of an operating system, is able to parse the application's needs and direct the work to the appropriate processor (scalar, vector, matrix, etc.) the efficiencies of an integrated heterogeneous processor will be overwhelming. Hardware has led software by an increasing number of years. In the early 80s, hardware was insufficient for the software. In the early 90s, hardware, gained parity with software. In the new millennium hardware capabilities have been exceeding the demands of software by about six months every other year. There is no Moore's law it seems for software development.
X-bit labs: What do you think about AMD's (and eventually Intel's) "Fusion" approach? Will it work out? It is already happening in certain markets, though...
Jon Peddie: It not only will work out, it is inevitable, and essential. It is inefficient to physically separate scalar and vector processors. Advantages of inter-processor communicating via an L3 cache [and additional logic] are too compelling to be ignored. With the new process nodes (32nm and smaller). The construction of these ultra-complex machines is economically feasible.
X-bit labs: How could you manage to squeeze 48-cores along with additional logic into 125W TDP? It is a remarkable achievement, by the way.
Sebastian Steibl: We have certain abilities to aggressively manage power consumption: different voltage and frequency domains are present within the SCC. But we also had to do a number of design trade-offs.