Re: Tree compression Qs (mmzip error messages)

From: Rene Brun (Rene.Brun@cern.ch)
Date: Tue May 18 1999 - 12:58:24 MEST


Hi Peter,
Thanks for reporting this interesting example.
The mmzip messages are fatal in your case.
In the case of your branches c4, c8, c12 the compressed buffer is larger
(!)
than the uncompressed buffer. In this pathological case, the compressed
buffer was not written to the file, explaining the abnormal compression
factor.
I have now introduced a protection (in TBasket::WriteBuffer). When I
detect
this kind of anaomaly, I write the uncompressed buffer instead.
I testest your macro with this new version and I get now the following
compression factors (always with compression level=2)
 tree1 cx = 2.78
 tree2 cx = 2.09
 tree3 cx = 1.74

This shows that splitting  always improve the compression
factor. This is easy to understand. With split mode, buffers contain
homogeneous data easier to compress.

In case you had larger buffers (128000) the mmzip error did not occur.
More buffers were written to the file explaining the apparent larger
file size. With my protection, this anomaly also disappears.

Rene Brun


Peter Lipa wrote:
> 
> Dear Rene,
> below is a macro that writes 3 trees of a white noise time series. I try to
> find the most compressible
> format since in our lab we produce several hours/day of parallel recordings
> of neural firing patterns  sampled at very high frequencies (up to 1MHz).
> Most of the time the samples fluctuate around 0 with
> noise interrupted by spikes and bursts.
> If I make a branch for EACH BYTE of a time series (eg. 4 branches for a
> signed Int_t signal) I find that
> I get compression ratios up to a factor 10 compared to a factor 1.5-2 for 1
> branch per Int_t.
> This is very valuable for us! (we don't care about the speed loss in reading
> the data).
> 
> In the macro below I have 4 white noise time series with variances from
> (10000,....,10). See for
> yourself how the compression ratios work out.
> However I ALWAYS get these annoying mmzip error messages and I don't know if
> it is just a warning
> (in which case I would like to turn the warning off in production code) or
> if there is something going
> wrong! ( Reading the tree with eg. your wonderful TTree::MakeClass()
> skeleton works fine and all
> data seem to be there - I checked only a few though).
> 
> When I increase the buffsize of tree 1 (t1) to 128000 (we have plenty of
> memory) so that the buffsize
> is larger than the one-byte timeseries (with 100000 samples) the
> TTree::Print() info does not report
> any written buffers (as you mentioned in your reply) but ALSO the file size
> of the resulting root
> file appears to be much much larger (compared to buffsize 16000 for example)
> so that I conclude
> that NO FULL compression was applied (or at least not to all branches). You
> can convince yourself
> by commenting out the t2->Fill() and t3->Fill() lines and comparing runs
> with only t1 (bufsiz=16000 and
> bufsiz = 128000).
> Specificaly I get in above cases (for tree t1 only!):
> CompressionLevel(0) :          bufsiz=16000        actual file size 2778 kB
> CompressionLevel(2) :          bufsiz=16000        actual file size   436 kB
> (factor 6.3)
> CompressionLevel(2) :          bufsiz=128000      actual file size   722 kB
> (factor 3.8 only)
> 
> So your reply:
> > The buffer however will be compressed when you save the Tree header on
> > the file.
> doesn't seem to apply FULLY in this case (or do I need to save the tree
> header explicitly with some command? I would assume that TFile::Close() does
> that automatically...)
> 
> In case more people want to compress time series, would it be a bad idea to
> create
> a branch option Split=2  that automatically splits the fields of a class
> into one branch PER byte??
> (Of course, it can always be done by hand, but it took me quite some trial
> and error to figure it out.)
> Compression ratios 5-10 are a STRONG argument for adopting ROOT for many
> labs, I would think...
> 
> Thanks a lot,
> Best regards,
> Peter
> 
> below the test macro (also in attachement ...)
> ----------------------------------------------------------------------------
> ------------------------------
> void treetest(int nmax = 100000){
>   // macro to test 3 methods to write a time series to a root file.
>   // Tree 1 has a branch for EACH BYTE of struct ts (i.e. 20 branches).
>   // Tree 2 has one branch for each Int_t word of ts (i.e. 5 branches).
>   // Tree 3 has one branch for the whole ts struct only.
>   //
>   // NOTE: the idea (hope) for tree 1 is that time series often
>   // fluctuate around some mean values (e.g. zero in the white noise
>   // case below) and the high bytes are mostly zero since outliers
>   // happen rarely. Those bytes should be highly compressible!
>   // This is extremely true for time series recorded from neuronal firing
>   // patters; those flucutate mostly around zero (with noise)
>   // and produce a spike only every 10-1000 msec.
>   //
>   // Results: tree 1 compresses by a factor 4.6 & produces mmzip error
> messages
>   //          tree 2                 factor 2   (no mmzip errors)
>   //          tree 3                 factor 1.75 (no mmzip errors)
> 
>   struct{
>     Int_t t;     // time stamps
>     Int_t w[4];  // waveform data
>   } ts;          // time series data point
> 
>   // use a union inside a struct for accessing the bytes of tmpT.l,tmpW[].l
>   struct{
>     union {
>       char c[4];
>       long l;
>     };
>   } tmpT, tmpW[4];
> 
> uct{
>     Char_t c;
>   } ch[20]; // struct for accessing the bytes of ts directly
> 
>   // OUT file
>   TFile f1("treetest.root","RECREATE","Test of root trees");
>   f1.SetCompressionLevel(0);
> 
>   // tree with 20 branches; one for each byte of tmpT, tmpW[4]
>   TTree *t1 = new TTree("T1"," Compressed bytes");
>   Int_t bufsz = 16000;        // tried 8000, ...,64000 - still get mmzip errors
>                               // with bufsz=128000, get no mmzip errors, but
>                               // but also NO obvious compression measured by
>                               // actual .root file size!
>   char* index = "     ";
>   TString bbase = "c";        // branch name base
>   TString bname, leafname;
>   for(Int_t j=0; j<20; j++){
>     sprintf(index,"%d",j);              // make j to zstring
>     bname = bbase + index;              // append index to bname base
>     leafname = bname + "/B1";           // append leaftype+size
>     t1->Branch(bname.Data(),&ch[j],leafname.Data(),bufsz);
>   }
> 
>   // tree with 5 branches; one for each tmpT, tmpW[4] struct
>   TTree *t2 = new TTree("T2"," Compressed structs tmpT, tmpW[4]");
>   bufsz = 64000;
>   bbase = "l";
>   for(j=0; j<4; j++){
>     sprintf(index,"%d",j);              // make j to zstring
>     bname = bbase + index;              // append index to bname base
>     leafname = bname + "/I";            // append leaftype+size
>     t2->Branch(bname.Data(),&tmpW[j],leafname.Data(),bufsz);
>   }
>   t2->Branch("tmpT",&tmpT,"l/I",bufsz);
> 
> 
>   // tree with 1 branch; one for all struct ts
>   TTree *t3 = new TTree("T3"," Compressed struct ts");
>   bufsz = 64000;
>   t3->Branch("ts",&ts,"t/I:w[4]/I",bufsz);
> 
> 
>   // create random numb generator object for white noise time series
>   TRandom rndm;
> 
>   //"event" loop
>   for (Int_t i=0; i<nmax; i++){
> 
>     //fill union tree with
> 4+1 dim white noise time series
>     tmpT.l    = i;                          // fill union
>     tmpW[0].l = (Int_t)rndm.Gaus(0,10000);   // fill union
>     tmpW[1].l = (Int_t)rndm.Gaus(0, 1000);   // fill union
>     tmpW[2].l = (Int_t)rndm.Gaus(0,  100);   // fill union
>     tmpW[3].l = (Int_t)rndm.Gaus(0,   10);   // fill union
> 
>     // copy tmpT, tmpW to ts and  ch[].c for filling t3 and t1
>     ts.t = tmpT.l;
>     for (j=0; j<4; j++){
>       ts.w[j] = tmpW[j].l;
>       ch[j].c = tmpT.c[j];
>       for (Int_t k=0; k<4; k++)
>  ch[4+4*j+k].c = tmpW[j].c[k];
>     }
> 
>     // fill trees
>     t1->Fill();
>     t2->Fill();
>     t3->Fill();
> 
>     // give some diagnostic info
>     if(!(i%10000)){
>       printf("i=%d l=%d ;\ntmpTc[0-3]=%d %d %d %d; ch[0-3]=%d %d %d %d\n",
>           i, tmpT.l, tmpT.c[0], tmpT.c[1], tmpT.c[2], tmpT.c[3],
>           ch[0].c, ch[1].c , ch[2].c, ch[3].c);
>       printf("tmpW[0]=%d\n tmpW[0]c[0-3]=%d %d %d %d; ch[4-7]=%d %d %d
> %d\n",
>           tmpW[0].l, tmpW[0].c[0], tmpW[0].c[1], tmpW[0].c[2], tmpW[0].c[3],
>           ch[4].c, ch[5].c , ch[6].c, ch[7].c);
>     }
>   } // end "event" loop
> 
>   f1.Write();
> 
>   printf("\n");
>   t1->Print();
>   printf("\n");
>   t2->Print();
>   printf("\n");
>   t3->Print();
> 
>   f1.Close();
> 
> }
> 
> ----- Original Message -----
> From: Rene Brun <Rene.Brun@cern.ch>
> To: Peter Lipa <lipa@nsma.arizona.edu>
> Cc: <roottalk@hpsalo.cern.ch>
> Sent: Monday, May 17, 1999 3:05 AM
> Subject: Re: Tree compression Qs
> 
> > Peter Lipa wrote:
> > >
> > > Hi root talkers,
> > >
> > > What does the error message:
> > > "mmzip: output buffer too small for in-memory compression"
> > > mean?
> > >
> > > I get it when filling a tree with several separate branches.
> > > Increasing the buffer size (from default 32000) to 128000 did not help.
> > >
> > > What do I have to do to get rid of this message and to compress
> properly?
> > >
> >
> > Could you send me an example of small macro reproducing this problem?
> >
> >
> > > Also, I noticed when the buffsize is bigger than the whole event series
> > > (so that only one buffer(Basket)  per branch is used) the branch is
> > > NOT compressed at all. Is that intentional?
> > >
> >
> > The current buffer in memory is not reported to be compressed by
> > TTree::Print.
> > The buffer however will be compressed when you save the Tree header on
> > the file.
> >
> > > I also wondered what the compression levels 3-9 mean??
> > > I found nothing in the docu and can't see any noticeable effect.
> > > So why are they implemented and when/how to use them?
> > >
> >
> > Compression levels 3->9 show little gain in space in general (this is
> > data
> > dependent). In particular, with Root trees, when branches contain
> > homogeneous data, the higher level compression levels will not give
> > additional advantages. they just take more time.
> >
> > Rene Brun
> >
> 
>   ------------------------------------------------------------------------
> 
>                     Name: treetest.C
>    treetest.C       Type: unspecified type (application/octet-stream)
>                 Encoding: quoted-printable



This archive was generated by hypermail 2b29 : Tue Jan 04 2000 - 00:43:33 MET