Saturday, December 1, 2012

Using performance counters on linux

One extremely useful feature linux has to offer is the ability to profile your system (user space and kernel) using the perf utility.

In a nutshell this utility allows you to count hardware and software events in linux kernel. Additional bonus is that when you count these events you can record cpu instruction pointer at the time of the event. Instruction pointer records, later can be used to generate concise execution profile of the kernel/user space code.

Similarly to git, perf uses sub-utils to introduce various functionality:

The first step is to list available events:

root@dev-12:# perf list

List of pre-defined events (to be used in -e):
  cpu-cycles OR cycles                               [Hardware event]
  stalled-cycles-frontend OR idle-cycles-frontend    [Hardware event]
  stalled-cycles-backend OR idle-cycles-backend      [Hardware event]
  instructions                                       [Hardware event]
  cache-references                                   [Hardware event]
  cache-misses                                       [Hardware event]
  branch-instructions OR branches                    [Hardware event]
  branch-misses                                      [Hardware event]
  bus-cycles                                         [Hardware event]

  cpu-clock                                          [Software event]
  task-clock                                         [Software event]
  page-faults OR faults                              [Software event]

There are tons of events which we can divide into the following main categories:
  • Hardware events
  • Hardware cache events
  • Software events
  • Tracepoint events
Tracepoints events are special places in kernel that were specified by developers as a good position to trace. Stopping there usually brings you to the location where some important kernel functions starts or completes.

For example block:block_rq_complete  trace event is passed when block i/o request completes.

Hardware events make use of special cpu hardware registers, which count cpu cpecific hardware events and trigger interrupt when certain threshold is passed. Software events do not require special hardware support and usually generated by kernel handlers which process special events such as page fault.

To watch statistics of events in the system you can use stat command:

root@dev-12:# perf stat -e syscalls:sys_enter_write,page-faults \
                                       echo hello world > 1.txt

 Performance counter stats for 'echo hello world':

                 1 syscalls:sys_enter_write                                    
               165 page-faults                                                 

       0.001013974 seconds time elapsed

filtering using -e flag is possible and it is possible to count events on existing process using -p flag

Another step in using perf is to realize that events can be recorded using record command:

root@dev-12:# perf record -f hdparm -t /dev/vda > /dev/null
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.038 MB (~1673 samples) ]

And later code execution profile can be created using report command (each time event occur the current instruction pointer is recorded as well):

root@dev-12:/home/local/a/hdparm-9.37# perf report
# Events: 1K cpu-clock
# Overhead  Command      Shared Object                                Symbol
# ........  .......  .................  ....................................
    25.80%   hdparm  [kernel.kallsyms]  [k] copy_user_generic_string
     8.21%   hdparm  [kernel.kallsyms]  [k] blk_flush_plug_list
     4.69%   hdparm  [kernel.kallsyms]  [k] get_page_from_freelist
     3.62%   hdparm  [kernel.kallsyms]  [k] add_to_page_cache_locked
     3.09%   hdparm  hdparm             [.] read_big_block
     2.77%   hdparm  [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
     2.51%   hdparm  [kernel.kallsyms]  [k] kmem_cache_alloc
     1.87%   hdparm  [kernel.kallsyms]  [k] __mem_cgroup_commit_charge
     1.76%   hdparm  [kernel.kallsyms]  [k] file_read_actor
     1.76%   hdparm  [kernel.kallsyms]  [k] __alloc_pages_nodemask
     1.55%   hdparm  [kernel.kallsyms]  [k] alloc_pages_current

We can filter the kernel code and see where in our program we spend most of our time

root@dev-12:# perf report -d hdparm
# Events: 64  cpu-clock
# Overhead  Command          Symbol
# ........  .......  ..............
    90.62%   hdparm  read_big_block
     9.38%   hdparm  time_device

Also by default we use cpu-clock event as the point where we stop to look at our code execution. We can use different events to find out interesting things. For example what are the places in our code that cause page faults:

root@dev-12:# perf record -f -e page-faults -F 100000 hdparm -t /dev/vda > /dev/null
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.006 MB (~278 samples) ]
root@dev-12:# perf report
# Events: 87  page-faults
# Overhead  Command      Shared Object                    Symbol
# ........  .......  .................  ........................
    72.05%   hdparm  hdparm             [.] prepare_timing_buf
    10.96%   hdparm         [.] 0x16b0          
     5.34%   hdparm       [.] 0x86fc0         
     2.53%   hdparm  [kernel.kallsyms]  [k] copy_user_generic_string
     1.26%   hdparm       [.] __ctype_init
     1.26%   hdparm       [.] _IO_vfscanf
     1.26%   hdparm       [.] strchrnul
     1.26%   hdparm       [.] mmap64
     1.26%   hdparm  hdparm             [.] main
     1.26%   hdparm  hdparm             [.] get_dev_geometry
     1.26%   hdparm  [kernel.kallsyms]  [k] __strncpy_from_user
     0.28%   hdparm  [kernel.kallsyms]  [k] __clear_user

To conclude, perf utility is extremely powerful tool. It provides accurate statistics with very little overhead, doesn't require recompilation and can be used to completely understand the overheads of kernel modules and user space application.

What a nice addition to the arsenal of your favorite hacking tools :)