Wednesday, June 25, 2014

CPU Cache Essentials

This post came to my mind after watching the excellent presentation of Scott Meyers called "CPU Caches and Why You care". This post will try to summarize the ideas of the presentation so if you have some spare time you can just watch the presentation on video.

To emphasize importance of cpu caches in our daily work we start with 2 examples:

The first problem is a simple traversing of 2 dimensional array. The way we can do this in c-like language is by traversing the array row by row. Alternatively we can traverse it column by column.

uint64_t matrix[ROWS][COLUMNS];
uint64_t i,j;

uint64_t ret;

/* row by row */
for (i = 0; i < ROWS; i++)
  for (j = 0; j < COLUMNS; j++)
    ret += matrix[i][j];

/* column by column */
for (j = 0; j < COLUMNS; j++)
  for (i = 0; i < ROWS; i++)
    ret += matrix[i][j];

Strangely for large arrays (> 10MB) traversing column by column leads to terrible performance.

The second problem is a parallel processing of some data in large array. We divide the array into X chunks and process each chunk in a separate thread. For example we want to count number of bytes which are set to 1. The following implementation doesn't scale when we run it on machines with more and more cores.

char array[SIZE_10_MB];
int X = NUM_OF_CORES;  
int results[X];

void chunk_worker(int index)
  int i;
  int work_size = SIZE_10_MB/X;
  for (i = work_size * index; i < work_size * (index + 1); i++) {
    if (array[i] == 1) {
      results[index] += 1;

This weird behavior can be explained after we learn about the cpu caches.

CPU caches are a small amount of unusually fast memory. We have 3 types of caches in a regular CPU:
  • D-cache  - cache used to store data
  • I-cache - cache used to store code (instructions)
  • TLB - cache used to store virtual to real memory address translations
These caches are arranged in a typical 3 layer hierarchy:

   typical i7-9xx (4 cores) example
   |     |           | share by |  shared by |            |
   |     | I/D cache |    cores | hw threads | latency    |
   | L1  | 32KB/32KB |        1 |          2 | 4 cycles   |
   | L2  | 256KB     |        1 |          2 | 11 cycles  |
   | L3  | 8MB       |        4 |          8 | 39 cycles  |
   | RAM |           |          |            | 107 cycles |

For example L2 is 256KB chunk of fast memory (11 cycles access time) which is used to cache both data and instructions and which is shared by 2 hardware threads on single core.

By the way there is one type of memory that we didn't mention which can beat the performance of all these layers - the cpu registers.

Now when we talk about making your programs fast and furious, the only thing that is really matters is how well you can fit into the cache hierarchy. It won't even matter if you are using machine with 64G. In the hardware world smaller is faster. Compact code and data structures will always be fastest.

Since access to main memory is so expensive, the hardware will bring a whole chunk of memory to put it into cache line. Typical size of cache line is 64 bytes so each time we read one byte of memory 64 bytes of data will enter our cache (and probably evict some other 64 bytes). Writing one byte of memory will eventually lead to writing 64 bytes of data to memory.

One interesting thing about these cache lines is the fact that our hardware is pretty smart to perform prefetch of cache line once it detects forward/backward traversal.

Thinking back about first problem of traversing matrix. It is now clear why the column by column case is having such a bad performance. When we traversing column by column we are not using each cache line effectively. In fact be bring complete cache line just to access one byte and later when the cache line is evicted from the small cache we our accessing the second byte.

Reasoning about the coherency of the different caches becomes impossible task. Luckily we don't have to reason too much, the hardware will take care of synchronization as long as we will use proper synchronization primitives (high level mutexes, read/write barriers and etc). Unfortunately this simplification comes with cost - TIME. Your hardware will spend precious time on synchronization which will reduce the performance of your program.

Another effect of CPU caches is called "False Sharing".  Suppose core 0 reads address A and core 1 writes to address A+1. Since A and A+1 occupy same cache line, hardware will need to synchronize caches by constantly invalidating the cache line and catching it back. This is exactly what happens in problem 2 where:

results[index] += 1;

invalidates the cache line each increment.

Quick fix of using local variable to maintain result of each thread and setting them at the end leads to performance boost.

int array[SIZE_10_MB];
int X = NUM_OF_CORES;  
int results[X];

void chunk_worker(int index)
  int i;
  int sum = 0;
  int work_size = SIZE_10_MB/X;
  for (i = work_size * index; i < work_size * (index + 1); i++) {
    if (array[i] == 1) {
      sum += 1;
  results[index] = sum;

To conclude here are some tips you can use to boost performance by being aware of the CPU cache tradeoffs:

Data cache tips:

  • Use linear array traversal. Hardware will often optimize and pref-etch the data so that the speedup will be substantial
    • Use as much of cache line as possible. For example in the next code when else clause is happening, we are throwing complete cache line which was fetched by accessing the is_alive member. Solution to this could be to make sure that most objects are alive.

struct Obj {
  bool is_alive;

std::vector<Object> objs;

for (auto o: objs) {
  if (o.is_alive)
  else {
    // just thrown a cache line
  • Be alert for false sharing in multi-core systems

Code cache tips:

  • Avoid iteration over heterogeneous sequence of objects with virtual calls. If we have sequence of heterogeneous objects the best thing would be to sort them by type so that executing virtual function of one object will lead to fetching code which can be used by the next object.
  • Make fast paths using branch-free sequences of code
  • Inline cautiously. 
    • Pros: reduce branches which will lead to speedup, compiler optimizations now possible
    • Cons: code duplication reduces code cache use
  • Use Profile-guided Optimizations (PGO) and Whole Program Optimizations (WPO) tools -these are automatic tools which will help you to optimize your code

Wednesday, February 12, 2014

Building Distributed Cache with GlusterFS - not a good idea

Cache is a popular concept. You can't survive in the cruel word of programming without using cache. Cache is inside you cpu. There is one inside you browser. Your favorite web site is serving pages from cache and you probably have one in your brain as well :)

Anyway, my mission was to build a distributed cache which can be used to speedup application X. This cache needed to be scalable and to allow extending it by adding more compute nodes. The compute nodes should be cheap amazon instances and the cache should provide very low latency (time until first byte of data arrives should be around 5ms).

Distributed file systems are a very cool peace of technology, so naturally it seems as a good idea to use it in building this cache. After reading about Ceph vs Gluster wars I've decided to settle down on Gluster. I liked very much the idea of translators which add functionality layer by layer. Each translator is a simple unit which handles one task of complicated file system.

The DHT translator in Gluster allows to distribute files between the different nodes. Later when the client wants to read file, it can locate the correct node using O(1) hash function computation and to read it directly from the node where it is located.

Using amazon, I quickly created 5 m1.large instances and installed glusterfs. The installation is pretty easy. I used my favorite tool for repeating tasks on multiple compute nodes (fabric) and after a while installation of x nodes became as simple as running

>> fab -H hostname1,hostname2,hostnamexx -P -f install_gluster create_volume create_mount

Next I created distributed glusterfs volume consisting of 5 nodes

>> gluster volume create cache hostname1:/brick1/cache hostname2:/brick1/cache hostname3:/brick1/cache hostname4:/brick1/cache hostname5:/brick1/cache

I modified application X code to try and use the remote cache and my algorithm became something similar to:

fd = open("/local_cache/file1/block1.bin", O_RDONLY);
if (fd == -1) {
    fd = open("/gluster/file1/block1.bin", O_RDONLY);   
pread(fd, ....)

And it worked well when there was small amount of application X instances (around 10). However when the amount of application X instances was increased (8 servers, 15 application X instances on each, total of 120 applications which continuously access glusterfs through fuse interface), I've noticed that access to mounted glusterfs file system became very slow.

Looking at latency of some of the glusterfs operation on client side (thanks good gluster has translator, called debug/io-stats, which can aggregate statistics), I've noticed that some of the lookup operation would take 10-20 seconds ! Basically lookup should be very lite operation, using the DHT translator the client should determine the exact server where the file should be located and then it can query it directly which will lead to distribution of stress between the 5 nodes. Each time I open the file a lookup command will be sent by gluster to determine where the file is located and later I can read it or write to it without a problem.

Fop     Call Count    Avg-Latency  Min-Latency    Max-Latency
---     ----------    -----------  -----------    -----------
STAT            58     2133.43 us      1.00 us    62789.00 us
MKDIR          112  1009804.77 us     18.00 us 14452294.00 us
OPEN           149    22108.78 us      2.00 us  1778858.00 us
READ           151   160008.16 us      7.00 us  7246801.00 us
FLUSH          149    13178.94 us      1.00 us  1594615.00 us
SETXATTR      2633   232459.26 us    719.00 us 10749344.00 us
LOOKUP        4725   641604.43 us      2.00 us 17630043.00 us
FORGET           1           0 us         0 us           0 us
RELEASE        149           0 us         0 us           0 us
------ ----- ----- ----- ----- ----- --- -----  ----- ----- ---

So why are lookups take so much time ? I realized I need to look at the network level. Luckily glusterfs has plugin for wireshark which can dissect traffic and show glusterfs operations.

Looking at the captured packets I saw that when we access file /file1/block1.bin, gluster will invoke 3 lookup operations. First for directory /, second for sub-directory file1 and then to file block1.bin.

Now here are the bad news: when performing lookup for directory, gluster will broadcast lookup request to all nodes and then wait until all nodes will respond. In my case one of the nodes (not the one where the actual file sits) would be super busy and would respond only after 5 seconds. And so the lookup request will wait for 5 seconds without trying to continue and actually communicate with the node where the file is located. This bad situation repeats for each path directory component and leads to awful performance with lookups.

Tweaking glusterfs options such as setting dht.lookup-unhashed=off, didn't help since this only relevant the file component and we still need to traverse at least two directory components until we reach the file component in the path.

So in my opinion such behavior makes gluster very unsuitable for functioning as cache where checking if data is in cache (lookup) is a frequent method. I guess hacking gluster and modifying the DHT translator can solve the issue (and introduce many other interesting issues :)

But this would be another interesting post ...