by Yury Malich
05/31/2006 | 01:49 PM
The year 2005 marked the beginning of use of dual-core processors in desktop computers. On the single silicon wafer of such a processor two normal processor cores are located with all their resources, including L1 cache. L2 cache memory may be independent for each core or shared between them. A memory bus controller, inter-core communication controller, crossbar switch, etc. can also be located on the same wafer. Numerous tests prove the advantages of dual-core processors over single-core ones in a number of applications that support multi-threading. But there seem to have been no tests to show at what speed the cores can exchange data.
To better understand what this review is all about, you should be aware of the problems arising in communication between processors of a multiprocessor system.
The processors are working with data that are read from system memory to be modified and then written back. Data are cached in the CPU for faster processing, but more than one processor may request the same data in a multiprocessor system. This is not a problem if both the processors are just reading data, because they are both provided the most recent valid copy from system RAM. But if one of the processors modifies the data, the data are first changed in the cache memory and it is only after a while that they are written into system RAM. So, there is a potential conflict when one processor is trying to read data that have been modified and are currently stored in another processor’s cache.
Methods to solve conflicts of this kind are referred to as protocols of cache coherency maintenance. There exist multiple varieties of such protocols, but describing them is beyond the reach of this review (you can refer to documentation available on the CPU manufacturers’ websites for details).
All the different protocols of maintaining cache coherency usually transfer modified data between the processors via system memory, but the cores of a dual-core processor are located on the same wafer right next to each other, so there’s an opportunity for a direct transfer of data from one core’s cache into another’s. Such transfers might go at a very fast rate. Some PC reviewers even suppose that the total effective capacity of the caches of two separate cores can be compared to a single common cache of the same capacity. Is this supposition true? We are going to give our answer to this question in this review.
In order to measure the speed of access to another core’s cache we developed an algorithm in which two execution threads are working with a common data block. The test utility creates two threads each of which is assigned to one logical processor. One thread is working with a data block in memory in two modes: read-only (the data block is just transferred from system RAM into cache) and read-and-modify (the data are modified after load). Thus we can check out two possible cases:
Having done with the data block, the first thread gives control to the second thread which is waiting for the signal, and is suspended. The second thread reads the same data block and measures the speed.
The data may get cached by the CPU core when being read by the second thread, so the ordinary method of increasing the accuracy of read speed measurements by running the test many times on the same memory area doesn’t work. Instead, the second thread clears its cache from the loaded data by applying the CFLUSH command to each cache line with them and hands the control back to the first thread for another data load.
Throughout this review I will be referring to the core that runs the first thread (which loads the data block from memory into cache) as to the first core and to the core than runs the speed-measuring thread as to the second core .
The threads are synchronized by means of the so-called spin-wait loops (a simple loop that looks like while(!signal) in which the variable single is modified from outside by the other thread) in both threads. These loops ensure the maximum speed of handing control over to one thread by a signal from another thread and also keep the thread assigned to the same processor, preventing other threads to run on it and change the cache contents. Both threads are running at the highest priority (THREAD_PRIORITY_REALTIME).
To measure the data read latency in the second thread, we use a chain of linked pointers interspersed with commands to delay the use of the read data:
// eax is the beginning of the data block
xor ebx, ebx // ebx <=0
xor edx,edx // edx <=0
………
and edx, eax // synchronization
mov eax, [eax+ebx]// a data read
N*{and ebx, edx} // delay to use the read data
and edx, eax // syncing the moment of use of the read data
mov eax, [eax+ebx] // using the read data; the next data read
N*{and ebx, edx}
and edx, eax
mov eax, [eax+ebx]
………
Here, N is the number of delay commands. Delays help adjust to such effects of speculative execution as replays and analyze how the data transfer speed depends on the length of the chain. This will give us a better understanding of the characteristics of the memory subsystem. The sync command ensures the set order of delay commands preventing the processor from reordering the read commands in the chain.
We run the test with data blocks of 8KB to 4094MB stepping the powers of two. So we will cover a wide range of cases, from data blocks that completely fit into L1 cache to blocks that do not fit into the cache of any of the tested processors. Within the block data are processed with a step of 64 bytes.
The first processor to be tested in this review is the dual-core AMD Athlon 64 X2 with a rating of 4400+, a clock rate of 2200MHz and with 1MB of cache memory on each execution core. The two cores not only reside on the same silicon wafer but are also connected to the crossbar switch via a system request interface. All requests for data in system memory pass through the switch, so we can expect that the cores communicate between each other without the mediation of the system or memory bus.

We’ll set off with the case when the first thread reads data without modifying them. That is, the same copy of data is stored in the system memory and in the cache of the first-thread processor.

Pic.1: AMD Athlon 64 X2. Sequential reading of non-modified data
loaded into the cache of the other core.

Pic.2: AMD Athlon 64 X2. Random reading of non-modified data
loaded into the cache of the other core.
The diagrams show how the average data read latency depends on the length of the delay chain. The first graph (Picture 1) shows that the minimum latency at sequential access almost doesn’t depend on the size of the data block and is about 50 cycles irrespective of the expected data location. It means the data are always taken from system memory. Those 50 cycles shouldn’t be regarded as memory latency. This is the average time between a data request and data load considering the hardware prefetch.
The diagrams look more interesting at random access (Picture 2). The minimum latency is different for different data block sizes and is smaller for smaller blocks. However, a latency of over 80 cycles still looks too big for cache access, even for access to another core’s cache, and the latency of access to data blocks that fully fit into one level of cache memory shouldn’t that vary much. This makes me suspect an influence of the prefetch mechanism again. I’ll check this later on.
You can also see a step-like increase of latency as the delay commands chain becomes longer, noticeable from the 16KB data block and also characteristic of the 4MB block where there’s no talking about a cache hit. The step is 10 cycles high. The frequency multiplier of this CPU is 10, too. Could the step-like shape of the graph have any relation to the rate at which data are coming in from external (from the CPU’s point of view) memory? I’ll check this out, too.
So, the results of reading unmodified data make me think that there’s no fast reading directly from another core’s cache. This must be how the MOESI protocol is implemented here: it requests the most recent copy of data from system memory. Let’s see if we have more luck with modified data the valid copy of which is stored in the first core’s cache.

Pic.3: AMD Athlon 64 X2. Sequential reading of the data
modified in the cache of the other core.

Pic.4: AMD Athlon 64 X2. Random reading of the data
modified in the cache of the other core.
The results aren’t encouraging. The data transfer latency has become a little higher, but the overall picture has remained the same. The second thread’s data access latency is too high for this thread to be possibly reading directly from the first core’s cache. When randomly reading the modified data (Picture 4), there’s a small growth of data transfer latency for data blocks smaller than 512MB which may be due to the necessity to copy the modified cache lines into system RAM. The growth is very small, though, and there is no such latency growth when the data are accessed sequentially. This probably means that the memory controller doesn’t access the data after having just written them to memory, but returns them to the processor from the internal buffers, which is in fact right.
To make sure the data are read from the system bus, I’ll carry out a couple of tests more.
First I reduce the CPU frequency multiplier from 10 to 6 which results in a CPU clock rate of 1200MHz. And here are the results of reading the modified data:

Pic.5: AMD Athlon 64 X2, 1200MHz frequency. Random reading of the data
modified in the cache of the other core.
The steps in the graph are 6 cycles high now. This clearly indicates that data transfers into the core are synchronized with the CPU memory bus and are most likely performed through this very bus.
And now I’m going to check my supposition that the data are read from system RAM rather than from somewhere else by measuring the speed of reading data from it. To accomplish this, I just disabled data reads from the first thread. Thus, it’s only the second thread that works with data: it reads the data from system RAM and measures the latency, then clears the cache, reads the data again and clears the cache again.

Pic.6: AMD Athlon 64 X2. Sequential reading of the data from system RAM.

Pic.7: AMD Athlon 64 X2. Random reading of the data from the system RAM.
The graphs for reading the unmodified data loaded into the first core (Picture 1 and 2) and for reading the data from system RAM (Picture 6 and 7) almost coincide. So, the supposition that the hardware data prefetch mechanism affects the random-access latency is correct! There are only differences in the latency graphs of 8-16KB data blocks but they are due to the hardware data prefetch mechanism as we can make sure by running the tests multiple times.
So, I have to state that I can’t find any indication of direct data transfers from one execution core to another in the Athlon 64 X2 processor. According to my tests, the most recent copy of data is always read from system RAM. This must be a limitation of the MOESI protocol implementation. The following seems to happen when data are accessed: on receiving a read request probe read that the second core puts on the system bus, the first core performs a write-back of the modified cache line into memory. After this write or at the same time with it, the requested line is transferred to the second core. If the data in the first core’s cache haven’t been modified, they are read from system RAM. Why is there no direct transfer between the cores via the crossbar switch? Ask AMD’s engineers about that! :)
I think many readers would be interested in looking at the performance of a dual-processor platform in this test. Before proceeding to Intel’s processors, let’s have a look at the results of a system with two Opteron 254 processors clocked at 2800MHz.

Pic.8: Dual-processor AMD Opteron system. Sequential reading of
non-modified data loaded into the cache of the other core.

Pic.9: Dual-processor AMD Opteron system. Random reading of
non-modified data loaded into the cache of the other core.

Pic.10: Dual-processor AMD Opteron system. Sequential reading of the data
modified in the cache of the other core.

Pic.11: Dual-processor AMD Opteron system. Random reading of the data
modified in the cache of the other core.
The results are similar to those of the Athlon 64 X2, but the latencies are bigger, especially at random access. The anomalous distribution of latencies at sequential access (higher latencies are observed with smaller data blocks) is probably due to the overhead for data transfers over the HyperTransport bus which worsens the average result when small data blocks are processed. It is clear that a dual-core processor is more efficient than two single-core ones when common data are processed.
The first processor from Intel in this review is Pentium D 920 on the Presler core. The processor has a clock rate of 2800MHz and has two cores with 2MB of L2 cache in each. Unlike the cores of the Athlon 64 X2 which are connected to a single crossbar switch all requests to the system and memory buses pass through, the cores of the Pentium D are connected to the common FSB in a simpler way, via a shared bus interface. So after I’ve examined the speed of data transfers between the cores of the Athlon 64 X2, I do not expect high speed from the Pentium D. But let’s look at the results:

Pic.12: Intel Pentium D. Sequential reading of non-modified data
loaded into the cache of the other core.

Pic.13: Intel Pentium D. Random reading of non-modified data
loaded into the cache of the other core.

Pic.14: Intel Pentium D. Sequential reading of the data
modified in the cache of the other core.

Pic.15: Intel Pentium D. Random reading of the data
modified in the cache of the other core.
The graphs have something in common with the graphs of the Athlon 64 X2, but there are some micro-architectural differences. The Athlon’s step-like manner of random memory access has changed into a wave-like one (Picture 13, 15). The wave is 18 cycles long which is exactly the length of the replay loop CPUs with the NetBurst architecture use to restart a chain of commands in case of cache misses or errors of speculative execution.
At sequential access to the modified data, there’s a considerable and identical increase of latency for each data block the size of 2MB and smaller. To be exact, it is the read speed that’s lower, whereas the speed of reading the 4MB data block doesn’t drop that much. When the data are accessed at random, it is all exactly the opposite – the latency of access to the modified data is lower! It looks like the modified data copied from the first core to system RAM can be transferred by the memory controller into the second core out of the intermediary buffers the same time they are written into the memory chips. This helps save some cycles at random reading. At sequential access, the speed of data transfers into the second core is still limited by the speed of the first core’s writing them into memory because writing data into system RAM takes more time than reading.
Now let’s check out if the data are actually read from system RAM. I’ll measure the speed of data reads from system RAM as I did with the Athlon 64 X2.

Pic.16: Intel Pentium D. Sequential reading of the data from system RAM.

Pic.17: Intel Pentium D. Random reading of the data from the system RAM.
The results aren’t too obvious. On one hand, the latency is higher with 64MB and smaller blocks in comparison with the graphs of reading the unmodified data loaded into the first core (Picture 12, 13). The latency for 8KB and 16KB data blocks have grown the most which corresponds to the size of the L1 data cache of this Pentium D processor. But on the other hand, the latency of reading small blocks has grown in the sequential read graphs (Picture 16) which wouldn’t be the case if the data is transferred directly from one core to another. The read latency for 128KB and larger data blocks almost coincide which means the data are read from system RAM rather than from the another core’s cache. The increase of latency that we observe here may be due to hardware prefetch or to the test algorithm (for example, the data-transfer overhead, which had been previously masked by the operation of the first core, is bigger relative to the real latency).
A Pentium 4 on the Prescott core with a clock rate of 3800MHz and 2MB of L2 cache is going to be tested for the comparison’s sake, too. It is a single-core processor that supports Hyper-Threading technology. You may be interested in the results this processor will show with the algorithm we use in this test session. What surprises can there be? The two virtual processors of this Pentium 4 are physically the same core with the same L1 and L2 caches. It means that common data are processed faster by both threads. The results of reading the data from the cache are almost identical irrespective of their validity (modified or unmodified), so I will only publish two graphs – the reading of the modified data:

Pic.18: Intel Pentium 4 + HT. Sequential reading of non-modified data.

Pic.19: Intel Pentium 4 + HT. Random reading of non-modified data.
No surprises here. There’s a latency of 4 cycles when reading data blocks that fit into the L1 data cache which corresponds to the latency of this cache. A latency of 22 cycles, corresponding to the L2 cache latency, is observed at sequential reading of data blocks up to 2048KB and at random reading of blocks up to 256KB. The latency minimum at the delay chain length of 18 cycles is the same as the length of the replay loop. The latency growth at random reading of 512KB and bigger data blocks is due to the TLB size limitation (64 entries which are sufficient for only 256KB of memory). Otherwise the results are just as they should be for two threads running on a processor with a common cache.
The Core Duo T2400 processor from Intel will be tested next. This dual-core processor has a clock rate of 1833MHz and is based on the Yonah core. The main difference of this CPU from those I’ve tested earlier in this review is that its 2 megabytes of cache memory is shared between the two execution cores.

With a shared L2 cache, the data read by the first core into the cache should be “visible” to the other core. This means we can hope to get good results here. Let’s see…

Pic.20: Intel Core Duo (Yonah). Sequential reading of non-modified data
loaded into the cache of the other core.

Pic.21: Intel Core Duo (Yonah). Random reading of non-modified data
loaded into the cache of the other core.
There’s a latency of 14 cycles when reading 1MB and smaller blocks of the unmodified data. This is exactly the latency of this processor’s L2 cache. You may ask why there’s a sudden increase of latency on the 2MB data block if the processor has 2MB of L2 cache. It’s because of the size of this processor’s TLB which is 256 entries for 4KB pages. That is, the TLB can serve only 1024KB of memory and when new pages are accessed, there must be performed a time-costly access to the page translation tables to translate virtual addresses into physical ones. This has a negative effect on the result, of course. So, we’ve got excellent performance when reading the unmodified data. Let’s see how good the processor is at reading the modified data.

Pic.22 : Intel Core Duo (Yonah). Sequential reading of the data
loaded and modified in the other core.

Pic.23 : Intel Core Duo (Yonah). Random reading of the data
loaded and modified in the other core.
The results are rather ambiguous, but let’s try to understand them. The latency is the lowest for the 1MB data block and grows for smaller data blocks, reaching a maximum on the data blocks that fully fit into the L1 cache memory (32KB). Is there something wrong with the test? No. You can notice that the graphs for small data blocks that fit into the L1 cache look very much alike to the latency graphs of the Athlon 64 X2 processor I tested at the beginning of the review. The characteristic step-like shape of the graphs is quite obvious. The steps are 11 cycles high which is exactly the value of the CPU frequency multiplier. Hence a surprising conclusion: the modified data in the other core’s cache are accessed here in the same way as in Athlon 64 X2 and Pentium D processors. That is, the most recent copy of data is first sent back to system RAM via the system bus and is then transferred into the second core. But why? The L1 caches of the Yonah’s cores use a write-back caching policy which means that after the data are changed by a core the modified line with the valid data copy is stored in the L1 cache until it is ousted from it whereas the L2 cache contains obsolete data. It seems that when a cache miss occurs in the shared L2 cache, the second core places a read request on the system bus and the first core (which stores the most recent copy of the data) responds to the probe read from the system bus by sending them to the bus rather than to the second core. I can’t say why the data is not saved directly into L2 cache for the second core to read them from there. Perhaps it would have taken a considerable redesign of the processor to make it work so.
The last processor to be tested is the newest, not yet officially announced processor from Intel codenamed Conroe. We are going to test an engineering sample of this CPU which has a clock rate of 2400MHz and a shared 4MB cache. Here are the results:

Pic.24: Intel Conroe. Sequential reading of non-modified data loaded into the other core.

Pic.25: Intel Conroe. Random reading of non-modified data loaded into the other core.

Pic.26: Intel Conroe. Sequential reading of the data loaded and modified in the other core.

Pic.27: Intel Conroe. Random reading of the data loaded and modified in the other core.
What strikes the eye immediately is that the graphs are similar to the Yonah’s but with different latencies.
First, the read latency corresponds to the L2 cache latency when reading the unmodified data (Picture 24, 25). What’s interesting, there are different L2 cache latencies at sequential and random data reads: 12 cycles at sequential (Picture 24) and about 14 cycles at random reading (Picture 25). I don’t yet know the real latency of the cache and the reasons for this difference may be explained after more tests.
Second, when reading the modified data there is a considerable increase of latency if the data modified by the other core fully fit into its L1 data cache – just like with the Yonah, but the value of the increase is much smaller (Picture 26, 27). The step-like pattern is also observed when the delay chain is long; the steps are 9 cycles high which corresponds to the CPU frequency multiplier (Picture 27).
Now I disable the reading of data by the first thread and check the speed of reading from system RAM.

Pic.28: Intel Conroe. Sequential reading of the data from system RAM.

Pic.29: Intel Conroe. Random reading of the data from the system RAM.
This processor works much more efficiently with system memory as you can see (Picture 28, 29). The higher speed of work with memory explains the lower latencies when reading modified data in another core’s L1 data cache in comparison with the Yonah.
So, the speed of data transfers between the cores is much higher in the Conroe than in the Yonah, but the modified data, if located in L1 data cache, are still transferred using the system bus.
None of the processors with separate caches tested in this review can perform fast data transfers between the cores. Intel’s Core Duo (Yonah) and Conroe, each with a shared L2 cache, are the only processors that ensure fast processing of the same data block by two cores, yet their speed is limited too when the common data are modified. It means that the resources of dual-core processors are employed in the most efficient way when the execution threads are working with different memory sections or with the same memory section but without modifying the common data. For higher performance, the developer may want to strictly assign the threads to the cores because the OS may change the assignment of the threads when switching between the tasks which results in a higher percentage of cache misses.