New Approach towards Integration
With the introduction of Nehalem microarchitecture Intel started making real steps towards increasing the level of integration in their processors. They began moving inside the CPU all of the functional units that used to be prerogative of system chipsets, such as memory controller, PCI Express bus controller, graphics core. The CPUs also received an L3 cache. In other words, the CPU has turned not into just a local computing center, but into a bundle of a numerous different complex units.
Of course, this integration has a lot of benefits to it and allows increasing the performance due to shorter wait time during data transfers. However, the more different units there are inside a CPU, the more difficult it is to connect them all electrically. And the most difficult task in this case would be the connection between shared L3 cache and the processor cores, especially, since the number of cores tends to increase later on. In other words, when the developers were working on Sandy Bridge processor microarchitecture, they had to give a lot of serious thought to convenient connection between all functional units inside the processor. The formerly used common crossbar interconnect worked fine in dual-, quad- and six-core Nehalem processors, but it doesn’t fit for modular CPUs with a large number of different cores inside.
In fact, they have already taken it into account in eight-core server Nehalem-EX processors where they used an absolutely new technology to connect computational cores with the L3 cache. This technology is called Ring Bus and it has successfully migrated to the new Sandy Bridge microarchitecture. All the computational cores, cache, graphics core and North Bridge elements inside the new processors are connected via special ring bus supporting a QPI-like protocol, which allowed to significantly reduce the number of inter-processor connections needed for signal routing.
They divided the L3 cache of Sandy Bridge processors into equal banks, 2 MB each, in order to ensure communication between the processor functional units with the L3 cache via the ring bus. The original design implies that the number of these banks equals the number of processor cores. However, for marketing reasons they can disconnect some banks from the bus without any damage to the cache integrity and thus reduce the cache-memory size. Each of the cache-memory banks is managed by its own arbiter but at the same time all of them work closely together: the data in them is never duplicated. The use of banks doesn’t split the L3 cache, but rather increases its bandwidth, which in its turn scales according to the growing number of cores, and banks, respectively. For example, since the “ring” is 32 bytes wide, the peak L3 cache bandwidth inside a quad-core CPU working at 3.4 GHz frequency is 435.2 GB/s.
Scalability to the number of processor cores is not the only advantages of the ring bus. The latency of the L3 cache has also gone down, since data transfers along the “ring” take the shortest route. Now the L3 cache latency is 26-31 clock cycles, while the L3 cache in Nehalem processors offered 35-40 clocks latency. However, in this case you should keep in mind that all cache memory in Sandy Bridge works at the processor frequency, i.e. this is another reason why it has become faster.
Another advantage of the ring bus is its ability to also include the graphics core integrated into the processor to the general data transfer routes. It means that the graphics core in Sandy Bridge doesn’t work directly with the memory, but like the processor cores do – via L3 cache. This way it works faster and also eliminates the negative effect on the overall system performance caused by the graphics core trying to take part of the memory bus from the processor cores for its own needs.