Re: [ROOT] TTree modification

From: Anton Fokin (anton.fokin@smartquant.com)
Date: Sun Feb 25 2001 - 23:08:55 MET


Hi,

> I'd like to point out a few things. First of all, there is a principal
difference
> between the I/O  methods used by "DB-like" and "non-DB-like" applications
and one
> of the consequences of this difference is that the latter can achieve much
higher
> I/O bandwidth per process.

In old good times of DOS/CPM I've been involved in a low level database
design. From this experience
I would say that the highest I/O performance can be achieved if you

1. Write/Read fixed size binary records.
2. Do not provide insert functionality but do provide replace functionality
instead.
3. Delete records via setting "deleted" flag and clean up (rewrite) a
database during idle (night) time
4. Do format c: before installing a database to not jump between cilinders
on the HD
5. Provide smart buffering/caching which adopts (system) I/O buffer to
record size and user queries.
6. Provide smart indexing (hashing for string fields)

This lets you read/write data with (near) your system I/O speed which can be
much higher than 25-30MB/sec
on modern SCSI devices.

The I/O performance currently is one of
> the key issues for HEP experiments. For example, CDF already had to modify
> ROOT object-oriented I/O mechanism when writing out the TTree's out of the
DAQ
> to be be able to achieve the rate of about 25-30 MB/sec per process (the
default I/O
> doesn't provide such rate), and this is what defines the overall data
logging
> rate for us now.

I think that ROOT Trees are much heavier than 1-6 described above. That is
the reason for your modification.

> ROOT I/O allows to write objects into a file and to modify/delete them
after they
> have been written. TTree is just one of many objects ROOT can write out.
> Let people correct me if I'm wrong, but as far as I can tell, TTree is a
very
> specialized container, designed to optimize the I/O performance for the
objects
> stored in it, and the assumption that the TTree object is not going to be
modified,
> only appended, seems to be quite important for this optimization.
Therefore,
> I'd be extremely cautious about making any changes to the design of TTree
which
> could have impact on the performance.

Object serialization mechanism which we use in ROOT was initially developed
by Borland for TurboVision and TurboPascal 5.5-6.0 in somewhat 1985-90. I do
not think somebody considers this mechanism for real databasing.

Unfortunately TTree is the only database-like container in ROOT. ROOT
doesn't have a hierarchy of data storage classes which provide different
functionalities on different levels. For example if I write only fixed size
records with several binary fields I do not need 80% of TTree
functionalities. Thus I would guess I can gain xx% in I/O performace
providing a class for this specific case. At the same time I would like to
use TTree like query/drawing so that I would like the same (virtual) user
interface for all databasing classes.

> Definitely, having additional DB-oriented capabilities in ROOT would be
nice.
> However a question of whether these capabilities should be provided
> by modifying the TTree or by implementing a different kind of data
container
> is an open one.

This is exactly what I have asked. If nobody needs these features in TTree I
would like to write my own storage class for my project. I have also noticed
that ROOT doesn't work weel with small events of a few tens of bytes. Thus I
think it should be stated quite clearly in what field ROOT is suppose to be
used. Operating with hundreds of HEP 10-100MB events is quite different from
millions of 100 byte spectroscopy events.

> I'd also like to comment on another issue. I know that there is a lot of
> requests to the ROOT team coming from the HEP experiments, which
implementation
> requires significant resources. For example, PROOF-server is along-awaited
> project. The implementation of the specialized ROOT client-server utility
to
> minimize the traffic over the net when running ROOT on a remote node gives
another
> example. Full integration of the "TBuffer-exchange" (fast I/O) mode into
ROOT
> is yet another one. CDF has requested this mode and is using it for the
data-taking
> and I believe that the next generation of experiments will depend on this
mode
> even stronger. Taking into consideration the actual resources of the ROOT
team,
> I think, that we need to have well-specified priorities.

I do not want to start any kind of flame, but could you tell me why ROOT
team consists of only two persons if it serves experiments with billion
annual bugets? My long research experience tells me that scientific
organizations have a very inefficient management. Is it a kind of game? Just
for fun look into "future plans" on the ROOT site (last updated in 95 or so)
and compare it with the present ROOT status. If ROOT would take one or two
more people with permanent positions all these plans could become true.

Regards,
Anton



This archive was generated by hypermail 2b29 : Tue Jan 01 2002 - 17:50:37 MET