Re: [ROOT] Complex Matrix, Complex Vector

From: Thorsten Glebe (T.Glebe@mpi-hd.mpg.de)
Date: Mon May 21 2001 - 00:29:12 MEST


Hello,

to make sure that we are talking about the same things, let me explain some points

in more detail. The examples shown below might not be 100% syntactically correct,
but I hope you get the points. Sorry if you know all this already...

The big advantage of C++ over languages like FORTRAN/C is the possibility
of having abstraction in the code. We can design types (=classes) which allows
to make code more expressive. A typical way to write a Matrix class would
be like this:

class Matrix {
  public:
  Matrix(unsigned int row, unsigned int col) :
        row_(row), col_(col),
        array(new double[row*col]) {}
  ~Matrix() { delete[] array; }

  inline unsigned int row() const { return row_; }
  inline unsigned int col() const { return col_; }
  inline double operator()(unsigned int i, unsigned int j) { return
array[i*col_+j]; }
  ...

  private:
  double* array;
  unsigned int row_;
  unsigned int col_;
};

An operator could be implemented like this:

Matrix operator+(const Matrix& A, const Matrix& B) {
  assert(A.row() == B.row() && A.col() == B.col());

  Matrix C;
  for(unsigned int i=0; i<A.row(); ++i)
    for(unsigned int j=0; j<A.col(); ++j)
      C(i,j) = A(i,j) + B(i,j);
  return C;
}

A C++ statement like

 Matrix A,B,C;
 ...
 C = A + B;

would result in a low level code which is almost similar to what you would
also write in FORTRAN/C:

  for(unsigned int i=0; i<row_ ; ++i)
    for(unsigned int j=0; j<col_; ++j)
      C[j*col_+i] = A[j*col_+i] + B[j*col_+i];

Up to this point everything is almost perfect, you will not see a big
difference in performance between C++ and FORTRAN/C (Note: with respect to
execution
speed the above shown class design is still the worst thing I can imagine,
although
I use inlining wherever possible.)

The performance problem starts whith an expression like

  Matrix A,B,C,D;
  ...
  D = A + B + C;

The C++ compiler would generate code which does the following:

  Matrix temp;
  for(unsgined int i=...)
    for(unsigned int j=...)
      temp[...] = A[...] + B[...];

  for(unsigned int k=...)
    for(unsgined int l=...)
      D[...] = temp[...] + C[...];

By using FORTRAN/C, you would naturally write it like this:

  for(unsigned int i=...)
    for(unsigned int j=...)
      D[...] = A[...] + B[...] + C[...];

The difference is obvious: in case of C++ the compiler generates a temporary
object
and performs the nested loop twice!

This is what is called the C++ abstraction penalty: The compiler cannot generate
optimal code from the class because for a compiler code consists out of
instructions
and loops which can be optimized (by instruction reordering or loop unrolling) but
he
cannot merge loops because they can have side effects which a compiler cannot
easily analyze. The compiler cannot "understand" your class design.
This is what you have to pay for abstraction. In a low level language
(where you do not have abstraction) you immediatly write (by hand) the optimal
code (which is very time consuming and clumsy and this is why we like
abstraction).

No matter how we improve the above class design, the abstraction penalty we
cannot overcome, but this is what costs us most with respect to execution speed.

To understand how one can overcome the abstraction penalty, one has to know
a few things about templates:

In 1994, a person named Erwin Unruh discovered that one can use templates to
perform
computations at compile time. He demonstrated it by writing a piece of code which
caused the C++ compiler to compute and print out prime numbers at compile time.

A more simple example than the computation of prime numbers is the compile-time
computation of a faculty. This is how you can do it:

template <unsigned int I>
struct fac {
  static const unsigned long int result = I * fac<I-1>::result;
};

template <>
struct fac<0> {
  static const unsigned long int result = 1;
};

main() {
  // compute 20! at compile time:
  unsigned int f = fac<20>::f();
}

At compile time, the compiler unrolls the recursive template like this:

  unsigned int f = fac<20>::result
                 = 20 * fac<19>::result
                 = 20 * 19 * fac<18>::result
                 = ...
                 = 20 * 19 * 18 * ... * 1
                 = <whatever the result is>

In the object file you will find a single instruction with the result
of the computation.

The above example is not a cute trick, it is a feature of C++ templates which
has to do with the way a C++ compiler has to generate code from templates.
In fact it is possible to write a turing machine using templates, or, in other
words, C++ templates are a programming language by its own. You can formulate
if..then, for..., swich/case statemens by using templates.
The consequence is that you can produce arbitrary complex code out of templates.
This is what is called "template metaprogramming" (as you are writing meta
code, or, in other words, you do not write code any more, but code generation
rules).

The bottom line is that C++ is in fact a two level language and a C++ compiler
consists of an interpreter and a compiler part: The interpreter part interprets
templates to produce code from them, and in a second step the compiler part
compiles all that (whether a C++ compiler really works this way is another
question).

Now back to the abstraction penalty problem: You can look at the expression

 D = A + B + C

in a somewhat different way:

 D = BinaryOp<plus, A, BinaryOp<plus, B, C> >

in other words one can formulate (using templates) the binary (and unary)
operators of a Matrix class such that they build a parse tree.
The parse tree can be evaluated in a way that you avoid completly temporary
objects and the generated low-level
code is of the quality of hand-optimized FORTRAN/C code. These "expression
templates" are the key to high performance numerical computations in C++ and
libraries like Blitz++ are based on an expression template engine.

Now we have what we need: the user can write conveniently a Matrix/Vector
expression using the abstraction of a C++ class:

  D = A + B * C + -F;

and the compiler will translate this expression (which is then in fact a
templated parse tree) into low level code of high performance (in any case
much more performant compared to a class suffering from the abstraction penalty).

Another big improvement which can be made with templates is to skip the
(very time consuming) dynamic memory allocation and replace it by
a stack allocation at compile time:

template <unsigned int row, unsigned int col>
class Matrix {
  ...
  private:
  double array[row*col];
};

A third powerful application of templates is the specialization of algorithms,
which pays off at small Vector/Matrix dimensions, where loop overheads dominate
the execution time. Here you need heavily recursivly defined templates.
As a consequence you get code which the compiler can optimize much better.

Everything is fine now, but there is also some price to pay: Code which
uses heavily expression templates and/or recursive templates is very clumsy
and difficult to understand. The average user for sure will not want to program
this on his own, but use a readily available library.

After all that I hope you understand the points I want to make:

- In compiled ROOT code we can use whatever library we want, we all agree on this.

- In case of CINT it is no problem to integrate fully Matrix/Vector classes
  which do suffer from the abstraction penalty problem (classes which are not
based
  on expression templates)
- For Vector/Matrix classes based on expression templates it is almost impossible
  to integrate them fully into CINT, as CINT does not have the "Interpreter" part
  of a C++ compiler. This is the reason why CINT can only deal with template
  instantiations and not with template definitions. You would have to precompile
  any expression you can think of...
- I want to make you aware of the fact that among the many numerical C++ libraries

  which exist, you find those wich suffer from the abstraction penalty and those
which do not.
  Therefore one should carefully choose the library one wants to support in ROOT.

In fact I have made in the last months some speed comparisons between a
Matrix/Vector library suffering from the abstraction penalty and an implementation

based on expression templates (both libraries are written by my own, but the
results are
consistent with numbers quoted in literature). I have also made a speed comparison

between a Kalman filter based vertex fit algorithm based on Matrix/Vector classes
with
abstraction penalty versus an implementation based on expression templates.

The outcome is that the Kalman filter (which consists out of a lot of matrix
expressions, as you might know) is a factor 20 faster by using a Matrix class
based on expression templates compared to an implementation based on "naive"
C++ Matrix/Vector classes (I am talking here only about compiled code,
optimized with -O2 -funroll-loops!).

I also made a comparison between some Matrix/Vector expressions.

   Operation (10^6 times executed)            T1 [s]             T2[s]
T1/T2
   ------------------------------------------------------
   C = A * B
8.09                1.29           6.27
    D = A + B * C                                            12.44
1.43           8.70
    \vec{y} = A*\vec{x}                                  4.40
0.28        15.71
    D = A + B + C                                            11.29
0.43        26.25
    D = -A + B + C                                          13.27
0.47        28.23
    a=\vec{x}^T * A * \vec{x}                      2.56
0.07        36.57
  -------------------------------------------------------
    -funroll-loops                                           +3%
+30%

As you might expect, T2 is the time of the Vector/Matrix classes based on
expression templates.
The computer used was a Celeron 466 MHz, Matrix dimensions 3x3, Vector dimension
3,
gcc 2.95.2, flags used: -O2 -funroll-loops -finline-functions.
Its worth to remark that using loop unrolling you gain much more with expression
templates
because the expression templates have "pre-optimized" the code such that low-level
optimization
(done by the compiler) is more effective.

This "result" might look different depending on which libraries you use or which
kind of expressions you compare. But I hope you understood that libraries based
on expression templates are by design superior to those libraries which do not
use expression templates.

N.B.: I donīt have experience with Java, therefore I cannot say whether Java does
the job better...

Goodbye,
   Thorsten





This archive was generated by hypermail 2b29 : Tue Jan 01 2002 - 17:50:46 MET