Pages

Wednesday, June 25, 2014

CPU Cache Essentials

This post came to my mind after watching the excellent presentation of Scott Meyers called "CPU Caches and Why You care". This post will try to summarize the ideas of the presentation so if you have some spare time you can just watch the presentation on video.

To emphasize importance of cpu caches in our daily work we start with 2 examples:

The first problem is a simple traversing of 2 dimensional array. The way we can do this in c-like language is by traversing the array row by row. Alternatively we can traverse it column by column.


uint64_t matrix[ROWS][COLUMNS];
uint64_t i,j;

uint64_t ret;

/* row by row */
for (i = 0; i < ROWS; i++)
  for (j = 0; j < COLUMNS; j++)
    ret += matrix[i][j];

/* column by column */
for (j = 0; j < COLUMNS; j++)
  for (i = 0; i < ROWS; i++)
    ret += matrix[i][j];

Strangely for large arrays (> 10MB) traversing column by column leads to terrible performance.

The second problem is a parallel processing of some data in large array. We divide the array into X chunks and process each chunk in a separate thread. For example we want to count number of bytes which are set to 1. The following implementation doesn't scale when we run it on machines with more and more cores.


char array[SIZE_10_MB];
int X = NUM_OF_CORES;  
int results[X];

void chunk_worker(int index)
{
  int i;
  int work_size = SIZE_10_MB/X;
  for (i = work_size * index; i < work_size * (index + 1); i++) {
    if (array[i] == 1) {
      results[index] += 1;
    }
  }
}

This weird behavior can be explained after we learn about the cpu caches.

CPU caches are a small amount of unusually fast memory. We have 3 types of caches in a regular CPU:
  • D-cache  - cache used to store data
  • I-cache - cache used to store code (instructions)
  • TLB - cache used to store virtual to real memory address translations
These caches are arranged in a typical 3 layer hierarchy:

   typical i7-9xx (4 cores) example
   |     |           | share by |  shared by |            |
   |     | I/D cache |    cores | hw threads | latency    |
   | L1  | 32KB/32KB |        1 |          2 | 4 cycles   |
   | L2  | 256KB     |        1 |          2 | 11 cycles  |
   | L3  | 8MB       |        4 |          8 | 39 cycles  |
   | RAM |           |          |            | 107 cycles |

For example L2 is 256KB chunk of fast memory (11 cycles access time) which is used to cache both data and instructions and which is shared by 2 hardware threads on single core.

By the way there is one type of memory that we didn't mention which can beat the performance of all these layers - the cpu registers.

Now when we talk about making your programs fast and furious, the only thing that is really matters is how well you can fit into the cache hierarchy. It won't even matter if you are using machine with 64G. In the hardware world smaller is faster. Compact code and data structures will always be fastest.

Since access to main memory is so expensive, the hardware will bring a whole chunk of memory to put it into cache line. Typical size of cache line is 64 bytes so each time we read one byte of memory 64 bytes of data will enter our cache (and probably evict some other 64 bytes). Writing one byte of memory will eventually lead to writing 64 bytes of data to memory.

One interesting thing about these cache lines is the fact that our hardware is pretty smart to perform prefetch of cache line once it detects forward/backward traversal.

Thinking back about first problem of traversing matrix. It is now clear why the column by column case is having such a bad performance. When we traversing column by column we are not using each cache line effectively. In fact be bring complete cache line just to access one byte and later when the cache line is evicted from the small cache we our accessing the second byte.

Reasoning about the coherency of the different caches becomes impossible task. Luckily we don't have to reason too much, the hardware will take care of synchronization as long as we will use proper synchronization primitives (high level mutexes, read/write barriers and etc). Unfortunately this simplification comes with cost - TIME. Your hardware will spend precious time on synchronization which will reduce the performance of your program.


Another effect of CPU caches is called "False Sharing".  Suppose core 0 reads address A and core 1 writes to address A+1. Since A and A+1 occupy same cache line, hardware will need to synchronize caches by constantly invalidating the cache line and catching it back. This is exactly what happens in problem 2 where:




results[index] += 1;

invalidates the cache line each increment.

Quick fix of using local variable to maintain result of each thread and setting them at the end leads to performance boost.


int array[SIZE_10_MB];
int X = NUM_OF_CORES;  
int results[X];

void chunk_worker(int index)
{
  int i;
  int sum = 0;
  int work_size = SIZE_10_MB/X;
  for (i = work_size * index; i < work_size * (index + 1); i++) {
    if (array[i] == 1) {
      sum += 1;
    }
  }
  results[index] = sum;
}

To conclude here are some tips you can use to boost performance by being aware of the CPU cache tradeoffs:

Data cache tips:

  • Use linear array traversal. Hardware will often optimize and pref-etch the data so that the speedup will be substantial
    • Use as much of cache line as possible. For example in the next code when else clause is happening, we are throwing complete cache line which was fetched by accessing the is_alive member. Solution to this could be to make sure that most objects are alive.

struct Obj {
  bool is_alive;
  ...
};

std::vector<Object> objs;

for (auto o: objs) {
  if (o.is_alive)
    do_stuff(o);
  else {
    // just thrown a cache line
  }
  • Be alert for false sharing in multi-core systems

Code cache tips:

  • Avoid iteration over heterogeneous sequence of objects with virtual calls. If we have sequence of heterogeneous objects the best thing would be to sort them by type so that executing virtual function of one object will lead to fetching code which can be used by the next object.
  • Make fast paths using branch-free sequences of code
  • Inline cautiously. 
    • Pros: reduce branches which will lead to speedup, compiler optimizations now possible
    • Cons: code duplication reduces code cache use
  • Use Profile-guided Optimizations (PGO) and Whole Program Optimizations (WPO) tools -these are automatic tools which will help you to optimize your code