Do we need yet another custom C++ interpreter?

Hi,

"A ROOT User" asks "Is it really necessary to replace CINT dictionary with cling?", bringing up very reasonable concerns and arguments against re-implementing CINT. I will try to answer his comments to clarify why we do it, and how it connects with the rest.

A fundamental misconception is that the status quo is acceptable. It is not, for several reasons.

  1. CINT vs C++
    CINT was designed (20 years ago!) to be a C interpreted; C++ support was added later. It still has many shortcomings with C++ 2003, let alone C++11.
  2. CINT maintenance
    The original author of CINT, Masaharu Goto, has moved on; CINT has been maintained mainly by the ROOT team. It has 300k lines of code; that's a considerable fraction of ROOT's 2.5MLOC. It has been designed to fit into an integrated processing unit of appliances (like medical ones) - not for 16GB RAM, 8 compute thread, 50000 class environments.
  3. Reflex and GCCXML solve it
    ATLAS, CMS and LHCb use GCCXML to parse their headers, a set of python scripts to parse the generated XML file and write a C++ source file, the Reflex dictionary, which then gets compiled, linked, loaded, its data injected into the Reflex reflection database, which then gets copied through Cintex into CINT. We have thus many duplications of strings (three in the worst case with Reflex) and conflicts between duplicate dictionaries in Reflex versus CINT (famous: "std::map<std::string, TH1*>" must not be described through Reflex). On top of that, GCCXML is a limited parser (e.g. it swallows typedefs in certain conditions, think Double32_t); as it uses the GCC parser this will not be fixed. I.e. the current setup is fragile, inefficient, and limiting.
  4. CINT is not relevant, I use PyROOT
    For calling into C++, PyROOT relies on CINT's reflection data from ROOT (which is why it's so fantastic compared to static SWIG-based approaches). And ROOT relies on CINT for I/O, both the dictionaries and the interpreter. I.e. you use CINT much, much more than you think: it's not just the prompt, it in the core of most of ROOT.

So we need to do something. C++ interpreters are extremely rare. Instead of rewriting a C++ interpreter we decided to reuse existing code. Code that we can still influence, but that's nevertheless production-grade. We expected that this will solve the maintenance and correctness issues. And because it's correct we don't need Reflex, but can instead use one central, fast (compiler!) reflection database.

So yes, this is a major overhaul of ROOT and the dictionaries. We will signal that with a new major ROOT version number. But we expect it to solve the correctness, stability, memory and CPU-consumption as well as the maintenance issues we currently have. The current implementation of cling (which is not yet complete) uses a mere 5000 lines of custom code developed by HEP; everything else is provided through LLVM and clang.

And regarding PyROOT: I am sure Wim will make good use of the new JIT power that comes with cling! Just like we expect the JIT to leave traces e.g. in TFormula, and the real reflection database in the I/O, THtml etc. It gets us unstuck, flexible and future-safe in many central areas of ROOT. O the places you'll go!

Cheers,
Axel

Other Python bindings

I noticed that there are multiple other ways to call C++ code from Python, one of them being included in the Boost library. What would it do to the complexity (and dependencies) to use an interface that doesn't build on top of CINT/Reflex?

Re: Other Python bindings

Hi Bram,

Thanks for your question. The main issue about the boost binding is that it is - as far as I understand - completely static and intrusive. PyROOT on the other hand is based on refection data, and it has features that e.g. the Boost binding doesn't offer (e.g. the mapping of concepts). Other bindings (e.g. SWIG-based ones) are difficult to maintain, not compatible with C++, and don't offer PyROOT's features either. So the cost is both on the implementation side and the feature side. Thus why not simply use PyROOT? :-)

Note that we will soon have a PyROOT that builds on top of clang, as part of ROOT 6. I think Wim (the author of PyROOT) plans to port it to a version without ROOT, likely involving PyPy. So that might be exactly what you are looking for :-)

Cheers, Axel.

Re: Other Python bindings

Hi Bram, Axel,

let me add to that (and point out that none of the mentioned tools are intrusive, btw.). The biggest problem with boost.python (with pyste; standalone it is a non-starter) and SWIG is that you need to run a separate tool to create and compile bindings. On top, these bindings are compiled against a specific version of Python, making for a distribution headache (just see the non-pickup of Python3 because of this problem). Compare: dictionaries are already available for all the most important classes in experiments, the EDM, because they are generated for I/O needs. They also do not depend on Python, and thus not on any specific version (only PyROOT does). Besides the obvious ease of use, there is also the benefit of lower memory footprints by not replicating structures. (For that matter, PyROOT creates bindings lazily, the others do not.)

Other problems we've had, are that boost.python is very, very slow and only in "keeping alive" mode since 2004 or so. Pyste is based on gccxml, so no C++11 there, and has seen no major updates since 2005. SWIG is much, much better in both regards, but not up to snuff: it plain and simply can not parse our header files. The way around that, is to write .i files, but as you can imagine, that duplication is not nice for maintenance. Worse, the developers of individual packages need to do this work, and not every C++ developer has Python, let alone SWIG, experience.

Then there's PyPy. All existing binding generator tools (including PyROOT) rely on CPython internals, or at least on the Python C-API. That does not jive with PyPy as it has for example a garbage collector instead of reference counting. Through some heroics, it does expose a Python C-API, but it's slow as it interferes (blocks, really) the just-in-time compiler. Therefore, within PyPy, there are two new approaches: cffi for C and cppyy for C++. Both are part of the standard PyPy releases. There is also already a PyROOT version for the latter (see: http://root.cern.ch/drupal/content/pypyroot).

Cheers,
Wim

Why?

I don't understand. You wish to maintain backwards compatibility. This implies maintaining the insanity that is the equivalence of "." and "->". Not only is this wrong, this egregiously ignores performance concerns that come with dereferencing. It also ensures that people using ROOT/Cling while learning C++ will have trouble compiling their programs using actual compilers. It implies that you intend to keep the (at best) insane class hierarchy TF1 <- TF2 <- TF3 and so on. This example shows some of the major design flaws in ROOT -- a 2-dimensional function _IS_ a 1-dimensional function? There is no abstract base class? No templates? It implies that you plan to keep the pointless T in front of all the names of ROOT, even though you will have access to namespaces (_finally_) and thus can move past the 1970's C practice of avoiding name collisions by a sort of weird Hungarian notation. It implies that you plan to maintain the outdated interfaces which make no use of templates. Templates are one of the most powerful features of C++, are more relevant to performance critical tasks than inheritance, and help ensure the type-safety of code (thereby ensuring the accuracy of data by helping to prevent accidental narrowing etc). It implies that you intend to continue to encourage the use of bare new and delete operators, instead of relying on the more efficient, reliable, and safe method of using RAII. Why? This begs the obvious question: why bother migrating at all? You wish to migrate to the modern and superior C++11 in order to not take advantage of its features? Why not just simply maintain ROOT5 and CINT, and just refuse to upgrade? ROOT is not a particularly good framework that is written in a language that isn't quite C++. If you are going to break away and make/use/write cling, then fix the poor design decisions: cling will probably break compatibility _anyways_ despite your best efforts, so you might as well take the time and effort to refactor and clean up the code base. A simple example, taken from this website: """ TFFTComplex One of the interface classes to the FFTW package, can be used directly or via the TVirtualFFT class. Only the basic interface of FFTW is implemented. Computes complex input/output discrete Fourier transforms (DFT) in one or more dimensions. For the detailed information on the computed transforms please refer to the FFTW manual, chapter "What FFTW really computes". How to use it: 1) Create an instance of TFFTComplex - this will allocate input and output arrays (unless an in-place transform is specified) 2) Run the Init() function with the desired flags and settings ... """ This is simply poor design. This should look like: root::Fft, removing the T, using a namespace, using templates instead of inheritence, etc. But also, notice that you have to run an Init function. Why? That is specifically what the constructor is for. Why does everything in ROOT know how to draw itself? Why does everything in ROOT have 100 methods, for "quick access" to other objects that do the actual work of those methods? These are questions that should be asked. But most of all, if you aren't going to fix these problems, why bother migrating at all? You fail to treat this migration as what it actually is. You are migrating to a _new language_, not a new version of a language. ROOT isn't written in C++. It's written in CINT. I am just frustrated to see this happen, because I know this community can do better. Maybe I'll make a draft of some smaller changes that need to be made and submit them to the mailing list. But honestly, I'm not very hopeful about this migration.

Re: Why?

Hi Matt!

Thanks for your feedback; I'll tried to reply to each of your comments one by one. I do not disagree with all of your comments, but I might have explanations for some of them :-) Sometimes you seem to misinterpret "backward compatibility" (which means "what used to work will continue to work") with "no change" - but that might just have been your motivation to take the time for writing your feedback, so I don't complain :-) Given the relevance of your comments I decided to reply in a separate blog post.

Cheers, Axel

Thank you for the very nice

Thank you for the very nice explanation of CINT vs cling issue. I did not know that Reflex relies on CINT. The proposed upgrade to cling sounds very promising indeed. By the way, we will also need to consider backward compatibility as experiments will still need to read data already recorded in 2010/2011.

Re: Backward Compatibility

Hi ROOT user,

Thanks for your comment! And yes, backward compatibility is key in this area. I will do all I can do reduce the amount of code we need to maintain only for backward compatibility reasons - e.g. Reflex can hopefully be removed instead of being rewired to tap the clang AST (i.e. the cling reflection database). But at the same time we will make sure that all data stored by the experiments remains readable (ideally even from 2001 :-).

This is mostly an issue of type names; CINT has some non-obvious (and non-standard compliant) naming conventions for types, and we must make sure that cling continues to understand them. Or we cannot read an edm::TaggedVector<edm::Jet> anymore (because CINT would have called it an edm::TaggedVector<Jet>).

We plan to release a snapshot of ROOT using cling in the third quarter of 2012; we will really appreciate feedback on problems with reading old files - as you correctly pointed out this is one of the most crucial ingredients of this project.

Cheers, Axel.

Thank you for clarifying a

Thank you for clarifying a transition plan, it is quite a reasonable approach. I just want to add a personal request to your wish list. Would it be possible to improve IO speed for reading? Very often analysis code is constrained by CPU/disk access limits when reading ntuples. The speed varies from ~100kHz for a tree with few float branches to ~200 Hz for complex data structure. A factor of few improvement for complex data can be a difference between requiring just one machine or a small farm.

Re: I/O Performance

Hi ROOT User,

We have dramatically improved the I/O performance over the last two years. If you use the latest production release also for writing data you might be able to see a performance improvement of an order of magnitude compared to e.g. 5.26, both in real and CPU time! See e.g. this blog entry.

We have been comparing the performance of ROOT I/O with competitors like Google ProtoBuf; we know exactly where we spend extra time and why, e.g. for schema evolution, proper C++ type support, introspection, pointers.

On the other hand, are you sure you make use of all the performance features ROOT offers? Did you enable the tree cache (on by default for PROOF and one tree per file, off - for now, still - otherwise)? Do you only read the branches you need? I am working on a new TTree read access class that should simplify all of that considerably (and is type safe - no more void*&!); maybe I should take your comment as an invitation to speed up :-)

Cheers, Axel.

Thanks again Axel for another

Thanks again Axel for another very nice overview of ROOT features. I definitely have not made use of these recent performance features, this now goes on my to-do list. By the way, I was thinking a bit more about my original post and I would like to try to explain better my point about a need for another interpreter. I was thinking about a following approach:

-- Rely gcc with reflex (or any other similar mechanism/compiler) to generate dictionaries for IO and PyROOT. This probably does not require an interpreter.
-- Promote use of PyROOT for analysis macros instead of macros interpreted by CINT.
-- Maintain CINT as for legacy reasons and plan an eventual phase out.

I suspect there is a good reason why C/C++ interpreters are rare - it is just not an easy/efficient way to run code. PyROOT is a fantastic tool and it can everything that CINT can, with an advantage of having access to all python libraries.

Anyway, this is just to clarify my point. I understand that you went through similar arguments and choose a development path best suitable for the community.

Re: Interpreters

Hi ROOT user,

Thanks for your comments - they are excellent!

Your scenario would probably work - but we decided against it, and I believe that we have good reasons for that :-)

GCCXML's future is limited; there is a re-write based on GCC's plugin mechanism, but both suffer from the same problems: we cannot influence what the GCC parser does. And reading headers, writing XML, parsing XML, writing (huge files of) C++, compiling, linking, loading - that's really, really inefficient and error prone.

Python is much simpler than C++. But it's still a horrible language in our environment, unless it's used as bash++. Not a single algorithm should be written in Python: it's terribly hard to convert it into C++, and it's incredibly slow in Python (ask the Google developers about youtube).

So C++ is not a good interpreted language, mainly due to its syntactic verbosity and its lack of dynamic interfaces and reflection capabilities - think

const std::type_info& ti = std::type_info::lookup("MyClass");
MyBase* ptr = ti.default_construct();
And Python is not appropriate for many use cases due to its lack of type safety and speed, and its lack of native binding to C++. Then which other language should we use?

Cheers, Axel

Hi Alex, Very good points

Hi Alex,

Very good points but let me try to defend python. I have found that a following approach (used by ATLAS that I also adopted in my private code) works fantastically well:

-- Use python to read configuration, find input files, etc;
-- Write performance critical code in C++;
-- Create C++ objects in python (relying on ROOT for dictionary support);
-- Pass configuration from python to C++;
-- Do calculations in C++;
-- Return results to python for processing, ploting, etc;
-- Run entire plot making code in python for stacking, labeling, etc.

Granted, this is probably a more complex approach than most of us in physics are willing to tolerate. I suspect that you do not have much choice since the user community wants CINT-like functionality from ROOT (and one feature of the ROOT project that makes it great is a full consideration of what experiments and users need for data taking and analysis).

Thanks for the interesting discussion! I have learned quite a bit about ROOT plans and it all seems very promising. Cheers!













Dependency on Python

In my experience, getting python scripts to work is a very unreliable affair. They almost always have dependencies on external packages and if you don't have EXACTLY the same version of python you only have about a 50% chance that anything you use will work. The language is simply not stable. C++ is bad enough. Scripting languages are much, much worse. The maintainers think the language is their playtoy and they take no responsibility to maintain backward compatibility from release to release. Python is just a Bad Idea(tm).

CINT need to be communitized, that's the whole problem

Indeed, for decade, Cint never made it to open-source because of ROOT dependencies and backward compatiblities. Now that I am seeing this arguement is no more, and we are going to dig up some old grave.. I don't see why Cint shouldn't be taking over by open source or boost. I don't see how we are going to leverage on Clang/Cling at all... all I am seeing is regression session will be made at a far higher degree. I "plussoie" Renee's point, mixing technologies is a very , veRY, VERY bad idea. Mostly because Python isn't an ISO standard like many others langage. We should stick to C++ ISO and that's all we need. Adding few more features in Cint isn't a big deal.

CINT and Open Source

Hi Daniel,

Thank you for your comment! As a matter of fact, CINT does not depend on ROOT at all. It is open source. It was used in commercial products independently of ROOT. I also don't see where the connection between cling and a python dependence comes in?

Given the amount of work that went into GCC to bring C++11 support I find it unrealistic that we (not compiler people!) would be able to lift CINT to C++11...

Cheers, Axel