Testbed and Methods
In order to measure the speed of access to another core’s cache we developed an algorithm in which two execution threads are working with a common data block. The test utility creates two threads each of which is assigned to one logical processor. One thread is working with a data block in memory in two modes: read-only (the data block is just transferred from system RAM into cache) and read-and-modify (the data are modified after load). Thus we can check out two possible cases:
- The most recent copy of data is stored in system RAM and in the second processor’s cache
- The most recent copy of data is stored only in the second processor’s cache whereas system RAM stores a stale copy
Having done with the data block, the first thread gives control to the second thread which is waiting for the signal, and is suspended. The second thread reads the same data block and measures the speed.
The data may get cached by the CPU core when being read by the second thread, so the ordinary method of increasing the accuracy of read speed measurements by running the test many times on the same memory area doesn’t work. Instead, the second thread clears its cache from the loaded data by applying the CFLUSH command to each cache line with them and hands the control back to the first thread for another data load.
Throughout this review I will be referring to the core that runs the first thread (which loads the data block from memory into cache) as to the first core and to the core than runs the speed-measuring thread as to the second core .
The threads are synchronized by means of the so-called spin-wait loops (a simple loop that looks like while(!signal) in which the variable single is modified from outside by the other thread) in both threads. These loops ensure the maximum speed of handing control over to one thread by a signal from another thread and also keep the thread assigned to the same processor, preventing other threads to run on it and change the cache contents. Both threads are running at the highest priority (THREAD_PRIORITY_REALTIME).
To measure the data read latency in the second thread, we use a chain of linked pointers interspersed with commands to delay the use of the read data:
// eax is the beginning of the data block
xor ebx, ebx // ebx <=0
xor edx,edx // edx <=0
………
and edx, eax // synchronization
mov eax, [eax+ebx]// a data read
N*{and ebx, edx} // delay to use the read data
and edx, eax // syncing the moment of use of the read data
mov eax, [eax+ebx] // using the read data; the next data read
N*{and ebx, edx}
and edx, eax
mov eax, [eax+ebx]
………
Here, N is the number of delay commands. Delays help adjust to such effects of speculative execution as replays and analyze how the data transfer speed depends on the length of the chain. This will give us a better understanding of the characteristics of the memory subsystem. The sync command ensures the set order of delay commands preventing the processor from reordering the read commands in the chain.
We run the test with data blocks of 8KB to 4094MB stepping the powers of two. So we will cover a wide range of cases, from data blocks that completely fit into L1 cache to blocks that do not fit into the cache of any of the tested processors. Within the block data are processed with a step of 64 bytes.



