Cling Performance Study

Measurements

The performance and size (disk, memory) of CINT and cling (the LLVM/clang-based interpreter prototype) have been studied to get a first rough estimate on the critical areas (critical for cling).

Disclaimer

These measurements were done before the implementation of a growing AST; we expect considerable (orders of magnitude) improvements with the current trunk of cling. To be confirmed beginning of 2011, once the growing AST is complete.

Procedure

I have used the files at http://root.cern.ch/svn/root/branches/dev/axel/cling-timing-size; the sources were all from 2010-04-07, using an optimized build with debug info on a Ubuntu 9.10 64bit machine with a Core2 quad. To not include overhead by TClass, cling was compared with standalone CINT.

The memory usage has been studied with ps (for resident set and virtual memory) and with tcmalloc (for heap usage), see makescripts_tcm.sh. Runtime was studied with bash's time builtin. Each study used a number of scripts generated from the same skeleton, which were then either only loaded (.L) or loaded and run (.X).

Several scripts have been tested for different analyses, e.g. the performance was tested by running this script 10 times.

Measurements

CINTclingcling/
CINT
Istartup
(no scripts)
real time0.12s.06s0.5
RSS7.1M9.7M1.4
VSZ31MB35MB1.1
heap2.3MB0.1MB0.0
IIrun:
10*.X skeleton_STL.C
real time99s1s0.0
RSS9.0MB18.1MB2.0
VSZ34MB41MB1.2
heap2.7MB1.6MB0.6
IIIload:
10*.L skeleton_STL.C
real time0.4s0.7s1.8
RSS8.0MB16.7MB2.1
VSZ33MB41MB1.2
heap2.7MB1.6MB0.6
IVload:
100*.L skeleton_STL.C
real time0.5s4.6s9.2
RSS8.6MB46.1MB5.4
VSZ34MB70MB2.1
heap2.7MB15.3MB5.7
Vload STL include without instantiations:
100*.L skeleton_only_include.C
real time0.3s4.3s7.0
RSS8.6MB27.9MB3.2
VSZ34MB52MB1.5
heap2.8MB11.3MB4.0

Library sizes, for optimized builds without debug symbols (no -g) and after strip. cling is linked with the static libraries from clang and LLVM, cling's and libCint's shared library dependencies are identical (libncurses, libstdc++,...).

libCint.so2.4MB
cling14.6MBi.e. cling is 6 times larger than libCint.so
libReflex.so0.7MB
libCintex.so0.2MB

Analysis

The obvious parts: Cling is bigger on disk. Cling executes code faster. The startup behavior of cling and CINT is identical!

The major two differences (and disadvantages of cling) are the following:

Load Time

Cling just-in-time compiles at the time of loading. Conceptually, .L s.C should run exactly the same code as dlopen("libs.so"): users expect global static initializers to be executed, e.g.

struct MyClass{
  MyClass() { printf("Boo!\n"); }
};
static MyClass gM;
must call the constructor of MyClass when loading the script.

For that to happen, the MyClass constructor must be executable, i.e. it must be just-in-time compiled, just like the initializer of gM which causes the constructor to be run. Thus JIT'ing at load time is the correct behavior.

We can accelerate this by

  • not using JIT but bytecode
  • using precompiled headers (see below)
  • reducing the number of optimization passes

Because cling shows correct behavior and CINT does not, the two numbers are not directly comparable; instead cling's number could be compared with ACLiC in which case cling "wins" in load time. See Cling vs. ACLiC for a discussion of these two.

Heap Usage

CINT keeps file positions for templates around, re-including an STL file from a different script is a no-op. Cling, on the other hand, will do a full re-parse of the included headers. With 10 sources including e.g. <vector>, CINT's memory usage is identical to one source include of <vector> while cling's memory usage will increase ten fold.

The load times of 100*.L skeleton_STL.C versus 100*.L skeleton_only_include.C show that the majority of the time and the majority of the heap is spent in parsing STL headers, not in template instantiation: adding / removing the instantiation has almost no effect on either.

We expect that the parse time can be reduced by using a precompiled header. This corresponds to CINT's STL dictionaries (vector.dll), except for the fact that the precompiled headers can be used for all template instantiations.

Resident Memory Usage

The resident size of cling is higher than that of CINT. It seems to scale with the number of loaded files: loading skeleton_STL.C 10 times versus 100 times increases the resident size by a factor three; a behavior that CINT does not show.

To investigate this issue we ran 4 test where we allocated some memory and either leave it alone or touch it (via memset) and ran with tcmalloc to see the behavior of the resident size with increases of the heap:
heap sizeRSS sizeVSZ size
just malloc15Mb 8.5Mb 40.2Mb
just malloc50Mb 10.9Mb 80.5Mb
memset15Mb 23.9Mb 40.2Mb
memset50Mb 62.1Mb 80.5Mb
From this tests, we can see that (at least with tcmalloc) allocating some memory incurs additional cost in resident memory which seems to be about the same (around 8Mb for the 15Mb case and around 15Mb for the 50Mb case) whether the allocated memory itself is touched. So from this we can concluded that the additional increase in resident memory size in the cling case is simply due to the (meta data of) the memory alllocator itself.

Cling vs. ACLiC

Cling is faster building compiled code, but ACLiC can reuse it. We need to

  • Compare the build times
  • Compare ACLiC's load time with cling's build time
  • Find out whether cling's compiled text can be cached on disk (note: we can do that asynchronously!)