Profiling guide

1. Introduction

There exists multiple ways to search for performance bottle necks in your application. This guide will briefly covers some methods to analyse different types of performance issues. The guide is not meant to be complete or as detailed as possible, but shall give an overview and quick introduction into different tools so you you can read on further for specific topics.

2. gprof

The tool gprof can be used to determine the amount of time spent in each function of the program.

To generate those information, your program must be compiled and linked with the gcc option -pg. It could look like this:

gcc -pg -c myprogram.c
gcc -pg -o myprogram myprogram.o
If your program uses an autoconf configure script, you can do something like this:
CPPFLAGS="-pg" LDFLAGS="-pg" ./configure
Once the program is compiled correctly, you just start and use it normally, especially of course those parts that you want to optimize.

After exiting the program, a file gmon.out will be written which contains the profiling information.

You can now start the program gprof with the program binary as argument:

gprof <executable name>
gprof will print the profiling results to stdout, which can look like this:
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
  8.82      0.06     0.06   907497     0.00     0.00  FieldListView::Content::getText(int) const
  5.88      0.10     0.04  5349237     0.00     0.00  NWC::FSEntry::getBasename() const
  4.41      0.13     0.03  4463406     0.00     0.00  NWCEntrySelectionState::getNWCEntry() const
The output is a sorted list of functions and their corresponding time used inside the function (self seconds) and accumulated for all child function calls (cumulative seconds). You can also see the number of times the functions has been called.

gprof can also output a call graph by using the argument -q and it can output annotated source code by using the argument -A.

3. gcov

To see how often each individual source code line is executed, the tool gcov can be used to generate coverage data.

As for profiling with gprof, the program needs to be compiled with special settings, in this case by using the gcc option --coverage:

gcc --coverage -c myprogram.c
gcc --coverage -o myprogram myprogram.o
or again for configure based programs:
CPPFLAGS="--coverage" LDFLAGS="--coverage" ./configure
Now run the program and use the performance critical sections. After exiting, there will be coverage data written for each source code file, ending in .gcda and gcno.

Human-readable information can be generated from that data by calling

gcov <source code file>
, for example:
gcov myprogram.c
The command gcov will generate a file named after the source code file, ending in .gcov.

This file contains each original line plus additional coverage information. The first column is the number of executions of each individual line, the second column is the line number, and the rest of the line is the original line. You can see which lines of code are never executed, and identify those lines which are executed often.

4. valgrind

valgrind allows to run unmodified programs using a special CPU simulation to identify runtime errors (such as accessing invalid memory), but also to profile programs. Multiple tools exists for valgrind for different purpose, I focus on those for profiling programs here. Programs analyzed with valgrind run a lot slower than native, but it also generates very accurate data. The execution can be 10 to 100 times slower.

4.1. callgrind

The valgrind tool callgrind can be used to determine the amount of instructions, and optionally the cache hits and misses for each function.

The generate data, execute valgrind with the following arguments:

valgrind --tool=callgrind <executable>
or to enable cache simulations as well:
valgrind --tool=callgrind --cache-sim=yes <executable>
The output will be a file callgrind.out.<pid>. To generate human-readable output, run the tool callgrind_annotate:
callgrind_annotate callgrind.out.<pid>
The output will be detailed information for each function about how many instructions are executed, and how often the cache is hit or missed:
Profile data file 'callgrind.out.10263' (creator: callgrind-3.8.1)
I1 cache: 32768 B, 64 B, 8-way associative
D1 cache: 32768 B, 64 B, 8-way associative
LL cache: 3145728 B, 64 B, 12-way associative
Timerange: Basic block 0 - 185719573
Trigger: Program termination
Profiled target:  src/worker (PID 10263, part 1)
Events recorded:  Ir Dr Dw I1mr D1mr D1mw ILmr DLmr DLmw
Events shown:     Ir Dr Dw I1mr D1mr D1mw ILmr DLmr DLmw
Event sort order: Ir Dr Dw I1mr D1mr D1mw ILmr DLmr DLmw
Thresholds:       99 0 0 0 0 0 0 0 0
Include dirs:     
User annotated:   
Auto-annotation:  off

         Ir          Dr          Dw    I1mr      D1mr    D1mw   ILmr    DLmr    DLmw 
929,365,663 243,792,453 130,424,363 435,433 3,158,168 233,042 31,941 211,400 122,625  PROGRAM TOTALS

         Ir         Dr         Dw   I1mr      D1mr   D1mw ILmr   DLmr   DLmw  file:function
127,536,933 40,890,707 10,123,465  5,309   168,812    140   88    197      .  ???:re_search_internal [/lib64/]
 81,487,358 24,569,171 18,151,774  2,097     9,847  1,122   66      1      1  ???:__gconv_transform_utf8_internal [/lib64/]
 73,196,978  8,049,720  7,887,472    448    19,227  1,231   16      .      .  ???:_wordcopy_fwd_dest_aligned [/lib64/]
It is also possible to annotate the source code to see the instruction counts for each source code line. Execute the tool callgrind_annotate with the additional argument --auto=yes:
callgrind_annotate --auto=yes callgrind.out.<pid>
The output will be annotated source code like this:
Auto-annotated source: src/aguix/
     Ir      Dr      Dw I1mr   D1mr D1mw ILmr DLmr DLmw 

line 21 ----------------------------------------
      .       .    .      .    .    .    .    .  
      .       .    .      .    .    .    .    .  #include "button.h"
      .       .    .      .    .    .    .    .  #include "awindow.h"
      .       .    .      .    .    .    .    .  #include "guielement.h"
      .       .    .      .    .    .    .    .  #include "drawablecont.hh"
      .       .    .      .    .    .    .    .  
      .       .    .      .    .    .    .    .  const char *Button::type="Button";
      .       .    .      .    .    .    .    .  
    497       0     213    6      0    0    2    .    .  Button::~Button()
 45,509  12,176   3,437   83    400    9   10   43    1  => /scratch0/t36/worker-3.6.2/src/aguix/ (71x)
      .       .    .      .    .    .    .    .  {
    355     142      71    0     36    0    0    1    .    if(text[0]!=NULL) _freesafe(text[0]);
  6,106   1,420     710    0     96    1    0   33    .  => ???:free (71x)
    355     142      71    .      .    .    .    .    .    if(text[1]!=NULL) _freesafe(text[1]);
  6,106   1,420     710    0     92    0    0   83    .  => ???:free (71x)
    426     213      71    .      .    .    .    .    .  }
 58,857  15,371   5,141   87    624   10   12  160    1  => src/aguix/'2 (71x)
  9,411   2,474   1,380    0     81    0    0   10    .  => ???:operator delete(void*) (71x)
      .       .    .      .    .    .    .    .  

4.2. cachegrind

There is an additional tool cachegrind, which is basically for cache simulation similar to callgrind (which is actually an extension to cachegrind). The usage is similar to callgrind:
valgrind --tool=cachegrind --branch-sim=yes <program> <args> ....
and the actual output is generated by calling cg_annotate:
cg_annotate --auto=yes cachegrind.out.<pid>

4.3. helgrind

The valgrind tool helgrind is not related to profiling, but I still want to mention it. It can be used to debug multi-threaded programs by analyzing possible race conditions. It's called like this:
valgrind --tool=helgrind <program> <args> ....
Just as the normal valgrind, it will output possible problems during executing, there is no separate analyzing tool for helgrind.

5. perf

The linux tool perf can be used to identify the functions taking the most time. It also runs on unmodified programs, and it can be even used to profile the whole systems, as it can profile all running processes.

5.1. perf top

The perf top command gives a quick overview about the functions using the most runtime for any process. Just execute
sudo perf top
It can be limited to a specific process by using the -p option to set the pid to watch.

5.2. perf record/report

A program can also be profiled separately by using the record/report mode of perf. First you run your program with the record mode:
perf record -g dwarf <binary>
and after quitting, the interactive mode report can be started:
perf report
The curses interface will output something like this:
Samples: 10K of event 'cycles', Event count (approx.): 7454153282               
+  14,49%  worker            [.] re_search_internal        
+  10,62%  worker            [.] __bronc                 
+   9,51%  worker            [.] _wordcopy_fwd_dest_aligned
The functions can be selected with the cursor keys, and the return key shows each call point of the corresponding function. This helps finding where the function is actually called very often.

By pressing cursor right, you can even select to annotate the assembler code to see which part actually uses the most runtime.

6. Further reading

You can find more information about the tools mentioned here on the corresponding homepages: