Bookmark and Share

Articles: CPU

Pages: [ 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 ]

AMD Athlon 64 X2

The first processor to be tested in this review is the dual-core AMD Athlon 64 X2 with a rating of 4400+, a clock rate of 2200MHz and with 1MB of cache memory on each execution core. The two cores not only reside on the same silicon wafer but are also connected to the crossbar switch via a system request interface. All requests for data in system memory pass through the switch, so we can expect that the cores communicate between each other without the mediation of the system or memory bus.

We’ll set off with the case when the first thread reads data without modifying them. That is, the same copy of data is stored in the system memory and in the cache of the first-thread processor.


Pic.1: AMD Athlon 64 X2. Sequential reading of non-modified data
loaded into the cache of the other core.


Pic.2: AMD Athlon 64 X2. Random reading of non-modified data
loaded into the cache of the other core.

The diagrams show how the average data read latency depends on the length of the delay chain. The first graph (Picture 1) shows that the minimum latency at sequential access almost doesn’t depend on the size of the data block and is about 50 cycles irrespective of the expected data location. It means the data are always taken from system memory. Those 50 cycles shouldn’t be regarded as memory latency. This is the average time between a data request and data load considering the hardware prefetch.

The diagrams look more interesting at random access (Picture 2). The minimum latency is different for different data block sizes and is smaller for smaller blocks. However, a latency of over 80 cycles still looks too big for cache access, even for access to another core’s cache, and the latency of access to data blocks that fully fit into one level of cache memory shouldn’t that vary much. This makes me suspect an influence of the prefetch mechanism again. I’ll check this later on.

You can also see a step-like increase of latency as the delay commands chain becomes longer, noticeable from the 16KB data block and also characteristic of the 4MB block where there’s no talking about a cache hit. The step is 10 cycles high. The frequency multiplier of this CPU is 10, too. Could the step-like shape of the graph have any relation to the rate at which data are coming in from external (from the CPU’s point of view) memory? I’ll check this out, too.

So, the results of reading unmodified data make me think that there’s no fast reading directly from another core’s cache. This must be how the MOESI protocol is implemented here: it requests the most recent copy of data from system memory. Let’s see if we have more luck with modified data the valid copy of which is stored in the first core’s cache.

Pages: [ 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 ]

Discussion

Comments currently: 1
Discussion started: 02/19/07 03:28:31 AM
Latest comment: 02/19/07 03:28:31 AM

View comments

You must log in to add comments.

Forgot password? Registration

remember me