Hello, to make sure that we are talking about the same things, let me explain some points in more detail. The examples shown below might not be 100% syntactically correct, but I hope you get the points. Sorry if you know all this already... The big advantage of C++ over languages like FORTRAN/C is the possibility of having abstraction in the code. We can design types (=classes) which allows to make code more expressive. A typical way to write a Matrix class would be like this: class Matrix { public: Matrix(unsigned int row, unsigned int col) : row_(row), col_(col), array(new double[row*col]) {} ~Matrix() { delete[] array; } inline unsigned int row() const { return row_; } inline unsigned int col() const { return col_; } inline double operator()(unsigned int i, unsigned int j) { return array[i*col_+j]; } ... private: double* array; unsigned int row_; unsigned int col_; }; An operator could be implemented like this: Matrix operator+(const Matrix& A, const Matrix& B) { assert(A.row() == B.row() && A.col() == B.col()); Matrix C; for(unsigned int i=0; i<A.row(); ++i) for(unsigned int j=0; j<A.col(); ++j) C(i,j) = A(i,j) + B(i,j); return C; } A C++ statement like Matrix A,B,C; ... C = A + B; would result in a low level code which is almost similar to what you would also write in FORTRAN/C: for(unsigned int i=0; i<row_ ; ++i) for(unsigned int j=0; j<col_; ++j) C[j*col_+i] = A[j*col_+i] + B[j*col_+i]; Up to this point everything is almost perfect, you will not see a big difference in performance between C++ and FORTRAN/C (Note: with respect to execution speed the above shown class design is still the worst thing I can imagine, although I use inlining wherever possible.) The performance problem starts whith an expression like Matrix A,B,C,D; ... D = A + B + C; The C++ compiler would generate code which does the following: Matrix temp; for(unsgined int i=...) for(unsigned int j=...) temp[...] = A[...] + B[...]; for(unsigned int k=...) for(unsgined int l=...) D[...] = temp[...] + C[...]; By using FORTRAN/C, you would naturally write it like this: for(unsigned int i=...) for(unsigned int j=...) D[...] = A[...] + B[...] + C[...]; The difference is obvious: in case of C++ the compiler generates a temporary object and performs the nested loop twice! This is what is called the C++ abstraction penalty: The compiler cannot generate optimal code from the class because for a compiler code consists out of instructions and loops which can be optimized (by instruction reordering or loop unrolling) but he cannot merge loops because they can have side effects which a compiler cannot easily analyze. The compiler cannot "understand" your class design. This is what you have to pay for abstraction. In a low level language (where you do not have abstraction) you immediatly write (by hand) the optimal code (which is very time consuming and clumsy and this is why we like abstraction). No matter how we improve the above class design, the abstraction penalty we cannot overcome, but this is what costs us most with respect to execution speed. To understand how one can overcome the abstraction penalty, one has to know a few things about templates: In 1994, a person named Erwin Unruh discovered that one can use templates to perform computations at compile time. He demonstrated it by writing a piece of code which caused the C++ compiler to compute and print out prime numbers at compile time. A more simple example than the computation of prime numbers is the compile-time computation of a faculty. This is how you can do it: template <unsigned int I> struct fac { static const unsigned long int result = I * fac<I-1>::result; }; template <> struct fac<0> { static const unsigned long int result = 1; }; main() { // compute 20! at compile time: unsigned int f = fac<20>::f(); } At compile time, the compiler unrolls the recursive template like this: unsigned int f = fac<20>::result = 20 * fac<19>::result = 20 * 19 * fac<18>::result = ... = 20 * 19 * 18 * ... * 1 = <whatever the result is> In the object file you will find a single instruction with the result of the computation. The above example is not a cute trick, it is a feature of C++ templates which has to do with the way a C++ compiler has to generate code from templates. In fact it is possible to write a turing machine using templates, or, in other words, C++ templates are a programming language by its own. You can formulate if..then, for..., swich/case statemens by using templates. The consequence is that you can produce arbitrary complex code out of templates. This is what is called "template metaprogramming" (as you are writing meta code, or, in other words, you do not write code any more, but code generation rules). The bottom line is that C++ is in fact a two level language and a C++ compiler consists of an interpreter and a compiler part: The interpreter part interprets templates to produce code from them, and in a second step the compiler part compiles all that (whether a C++ compiler really works this way is another question). Now back to the abstraction penalty problem: You can look at the expression D = A + B + C in a somewhat different way: D = BinaryOp<plus, A, BinaryOp<plus, B, C> > in other words one can formulate (using templates) the binary (and unary) operators of a Matrix class such that they build a parse tree. The parse tree can be evaluated in a way that you avoid completly temporary objects and the generated low-level code is of the quality of hand-optimized FORTRAN/C code. These "expression templates" are the key to high performance numerical computations in C++ and libraries like Blitz++ are based on an expression template engine. Now we have what we need: the user can write conveniently a Matrix/Vector expression using the abstraction of a C++ class: D = A + B * C + -F; and the compiler will translate this expression (which is then in fact a templated parse tree) into low level code of high performance (in any case much more performant compared to a class suffering from the abstraction penalty). Another big improvement which can be made with templates is to skip the (very time consuming) dynamic memory allocation and replace it by a stack allocation at compile time: template <unsigned int row, unsigned int col> class Matrix { ... private: double array[row*col]; }; A third powerful application of templates is the specialization of algorithms, which pays off at small Vector/Matrix dimensions, where loop overheads dominate the execution time. Here you need heavily recursivly defined templates. As a consequence you get code which the compiler can optimize much better. Everything is fine now, but there is also some price to pay: Code which uses heavily expression templates and/or recursive templates is very clumsy and difficult to understand. The average user for sure will not want to program this on his own, but use a readily available library. After all that I hope you understand the points I want to make: - In compiled ROOT code we can use whatever library we want, we all agree on this. - In case of CINT it is no problem to integrate fully Matrix/Vector classes which do suffer from the abstraction penalty problem (classes which are not based on expression templates) - For Vector/Matrix classes based on expression templates it is almost impossible to integrate them fully into CINT, as CINT does not have the "Interpreter" part of a C++ compiler. This is the reason why CINT can only deal with template instantiations and not with template definitions. You would have to precompile any expression you can think of... - I want to make you aware of the fact that among the many numerical C++ libraries which exist, you find those wich suffer from the abstraction penalty and those which do not. Therefore one should carefully choose the library one wants to support in ROOT. In fact I have made in the last months some speed comparisons between a Matrix/Vector library suffering from the abstraction penalty and an implementation based on expression templates (both libraries are written by my own, but the results are consistent with numbers quoted in literature). I have also made a speed comparison between a Kalman filter based vertex fit algorithm based on Matrix/Vector classes with abstraction penalty versus an implementation based on expression templates. The outcome is that the Kalman filter (which consists out of a lot of matrix expressions, as you might know) is a factor 20 faster by using a Matrix class based on expression templates compared to an implementation based on "naive" C++ Matrix/Vector classes (I am talking here only about compiled code, optimized with -O2 -funroll-loops!). I also made a comparison between some Matrix/Vector expressions. Operation (10^6 times executed) T1 [s] T2[s] T1/T2 ------------------------------------------------------ C = A * B 8.09 1.29 6.27 D = A + B * C 12.44 1.43 8.70 \vec{y} = A*\vec{x} 4.40 0.28 15.71 D = A + B + C 11.29 0.43 26.25 D = -A + B + C 13.27 0.47 28.23 a=\vec{x}^T * A * \vec{x} 2.56 0.07 36.57 ------------------------------------------------------- -funroll-loops +3% +30% As you might expect, T2 is the time of the Vector/Matrix classes based on expression templates. The computer used was a Celeron 466 MHz, Matrix dimensions 3x3, Vector dimension 3, gcc 2.95.2, flags used: -O2 -funroll-loops -finline-functions. Its worth to remark that using loop unrolling you gain much more with expression templates because the expression templates have "pre-optimized" the code such that low-level optimization (done by the compiler) is more effective. This "result" might look different depending on which libraries you use or which kind of expressions you compare. But I hope you understood that libraries based on expression templates are by design superior to those libraries which do not use expression templates. N.B.: I donīt have experience with Java, therefore I cannot say whether Java does the job better... Goodbye, Thorsten
This archive was generated by hypermail 2b29 : Tue Jan 01 2002 - 17:50:46 MET