Re: Tree compression Qs (mmzip error messages)

From: Peter Lipa (lipa@nsma.arizona.edu)
Date: Mon May 17 1999 - 21:34:56 MEST


Dear Rene,
below is a macro that writes 3 trees of a white noise time series. I try to
find the most compressible
format since in our lab we produce several hours/day of parallel recordings
of neural firing patterns  sampled at very high frequencies (up to 1MHz).
Most of the time the samples fluctuate around 0 with
noise interrupted by spikes and bursts.
If I make a branch for EACH BYTE of a time series (eg. 4 branches for a
signed Int_t signal) I find that
I get compression ratios up to a factor 10 compared to a factor 1.5-2 for 1
branch per Int_t.
This is very valuable for us! (we don't care about the speed loss in reading
the data).

In the macro below I have 4 white noise time series with variances from
(10000,....,10). See for
yourself how the compression ratios work out.
However I ALWAYS get these annoying mmzip error messages and I don't know if
it is just a warning
(in which case I would like to turn the warning off in production code) or
if there is something going
wrong! ( Reading the tree with eg. your wonderful TTree::MakeClass()
skeleton works fine and all
data seem to be there - I checked only a few though).

When I increase the buffsize of tree 1 (t1) to 128000 (we have plenty of
memory) so that the buffsize
is larger than the one-byte timeseries (with 100000 samples) the
TTree::Print() info does not report
any written buffers (as you mentioned in your reply) but ALSO the file size
of the resulting root
file appears to be much much larger (compared to buffsize 16000 for example)
so that I conclude
that NO FULL compression was applied (or at least not to all branches). You
can convince yourself
by commenting out the t2->Fill() and t3->Fill() lines and comparing runs
with only t1 (bufsiz=16000 and
bufsiz = 128000).
Specificaly I get in above cases (for tree t1 only!):
CompressionLevel(0) :          bufsiz=16000        actual file size 2778 kB
CompressionLevel(2) :          bufsiz=16000        actual file size   436 kB
(factor 6.3)
CompressionLevel(2) :          bufsiz=128000      actual file size   722 kB
(factor 3.8 only)

So your reply:
> The buffer however will be compressed when you save the Tree header on
> the file.
doesn't seem to apply FULLY in this case (or do I need to save the tree
header explicitly with some command? I would assume that TFile::Close() does
that automatically...)

In case more people want to compress time series, would it be a bad idea to
create
a branch option Split=2  that automatically splits the fields of a class
into one branch PER byte??
(Of course, it can always be done by hand, but it took me quite some trial
and error to figure it out.)
Compression ratios 5-10 are a STRONG argument for adopting ROOT for many
labs, I would think...

Thanks a lot,
Best regards,
Peter

below the test macro (also in attachement ...)
----------------------------------------------------------------------------
------------------------------
void treetest(int nmax = 100000){
  // macro to test 3 methods to write a time series to a root file.
  // Tree 1 has a branch for EACH BYTE of struct ts (i.e. 20 branches).
  // Tree 2 has one branch for each Int_t word of ts (i.e. 5 branches).
  // Tree 3 has one branch for the whole ts struct only.
  //
  // NOTE: the idea (hope) for tree 1 is that time series often
  // fluctuate around some mean values (e.g. zero in the white noise
  // case below) and the high bytes are mostly zero since outliers
  // happen rarely. Those bytes should be highly compressible!
  // This is extremely true for time series recorded from neuronal firing
  // patters; those flucutate mostly around zero (with noise)
  // and produce a spike only every 10-1000 msec.
  //
  // Results: tree 1 compresses by a factor 4.6 & produces mmzip error
messages
  //          tree 2                 factor 2   (no mmzip errors)
  //          tree 3                 factor 1.75 (no mmzip errors)

  struct{
    Int_t t;     // time stamps
    Int_t w[4];  // waveform data
  } ts;          // time series data point

  // use a union inside a struct for accessing the bytes of tmpT.l,tmpW[].l
  struct{
    union {
      char c[4];
      long l;
    };
  } tmpT, tmpW[4];


uct{ 
    Char_t c; 
  } ch[20]; // struct for accessing the bytes of ts directly
  
  // OUT file 
  TFile f1("treetest.root","RECREATE","Test of root trees");
  f1.SetCompressionLevel(0);
  
  // tree with 20 branches; one for each byte of tmpT, tmpW[4]
  TTree *t1 = new TTree("T1"," Compressed bytes");  
  Int_t bufsz = 16000;        // tried 8000, ...,64000 - still get mmzip errors
                              // with bufsz=128000, get no mmzip errors, but
                              // but also NO obvious compression measured by
                              // actual .root file size!
  char* index = "     ";  
  TString bbase = "c";        // branch name base
  TString bname, leafname;
  for(Int_t j=0; j<20; j++){
    sprintf(index,"%d",j);              // make j to zstring
    bname = bbase + index;              // append index to bname base
    leafname = bname + "/B1";           // append leaftype+size 
    t1->Branch(bname.Data(),&ch[j],leafname.Data(),bufsz);
  }

  // tree with 5 branches; one for each tmpT, tmpW[4] struct
  TTree *t2 = new TTree("T2"," Compressed structs tmpT, tmpW[4]");  
  bufsz = 64000;
  bbase = "l";
  for(j=0; j<4; j++){
    sprintf(index,"%d",j);              // make j to zstring
    bname = bbase + index;              // append index to bname base
    leafname = bname + "/I";            // append leaftype+size 
    t2->Branch(bname.Data(),&tmpW[j],leafname.Data(),bufsz);
  }
  t2->Branch("tmpT",&tmpT,"l/I",bufsz);
  

  // tree with 1 branch; one for all struct ts
  TTree *t3 = new TTree("T3"," Compressed struct ts");  
  bufsz = 64000;
  t3->Branch("ts",&ts,"t/I:w[4]/I",bufsz);
  
  
  // create random numb generator object for white noise time series
  TRandom rndm;             
  
  //"event" loop
  for (Int_t i=0; i<nmax; i++){

    //fill union tree with 
4+1 dim white noise time series
    tmpT.l    = i;                          // fill union
    tmpW[0].l = (Int_t)rndm.Gaus(0,10000);   // fill union
    tmpW[1].l = (Int_t)rndm.Gaus(0, 1000);   // fill union
    tmpW[2].l = (Int_t)rndm.Gaus(0,  100);   // fill union
    tmpW[3].l = (Int_t)rndm.Gaus(0,   10);   // fill union

    // copy tmpT, tmpW to ts and  ch[].c for filling t3 and t1
    ts.t = tmpT.l;
    for (j=0; j<4; j++){
      ts.w[j] = tmpW[j].l;
      ch[j].c = tmpT.c[j];
      for (Int_t k=0; k<4; k++)
 ch[4+4*j+k].c = tmpW[j].c[k];
    }

    // fill trees
    t1->Fill();
    t2->Fill();
    t3->Fill();

    // give some diagnostic info
    if(!(i%10000)){
      printf("i=%d l=%d ;\ntmpTc[0-3]=%d %d %d %d; ch[0-3]=%d %d %d %d\n",
          i, tmpT.l, tmpT.c[0], tmpT.c[1], tmpT.c[2], tmpT.c[3],
          ch[0].c, ch[1].c , ch[2].c, ch[3].c);
      printf("tmpW[0]=%d\n tmpW[0]c[0-3]=%d %d %d %d; ch[4-7]=%d %d %d
%d\n",
          tmpW[0].l, tmpW[0].c[0], tmpW[0].c[1], tmpW[0].c[2], tmpW[0].c[3],
          ch[4].c, ch[5].c , ch[6].c, ch[7].c);
    }
  } // end "event" loop


  f1.Write();

  printf("\n");
  t1->Print();
  printf("\n");
  t2->Print();
  printf("\n");
  t3->Print();

  f1.Close();

}


----- Original Message -----
From: Rene Brun <Rene.Brun@cern.ch>
To: Peter Lipa <lipa@nsma.arizona.edu>
Cc: <roottalk@hpsalo.cern.ch>
Sent: Monday, May 17, 1999 3:05 AM
Subject: Re: Tree compression Qs


> Peter Lipa wrote:
> >
> > Hi root talkers,
> >
> > What does the error message:
> > "mmzip: output buffer too small for in-memory compression"
> > mean?
> >
> > I get it when filling a tree with several separate branches.
> > Increasing the buffer size (from default 32000) to 128000 did not help.
> >
> > What do I have to do to get rid of this message and to compress
properly?
> >
>
> Could you send me an example of small macro reproducing this problem?
>
>
> > Also, I noticed when the buffsize is bigger than the whole event series
> > (so that only one buffer(Basket)  per branch is used) the branch is
> > NOT compressed at all. Is that intentional?
> >
>
> The current buffer in memory is not reported to be compressed by
> TTree::Print.
> The buffer however will be compressed when you save the Tree header on
> the file.
>
> > I also wondered what the compression levels 3-9 mean??
> > I found nothing in the docu and can't see any noticeable effect.
> > So why are they implemented and when/how to use them?
> >
>
> Compression levels 3->9 show little gain in space in general (this is
> data
> dependent). In particular, with Root trees, when branches contain
> homogeneous data, the higher level compression levels will not give
> additional advantages. they just take more time.
>
> Rene Brun
>





This archive was generated by hypermail 2b29 : Tue Jan 04 2000 - 00:43:33 MET