Re: [ROOT] TTree: broken disk-write error handling

From: Rene Brun (Rene.Brun@cern.ch)
Date: Wed May 26 2004 - 09:53:47 MEST


Hi Konstantin,

In the CVS version, I modified TFile::WriteBuffer to set the bit kWriteError
in case of errors of the type:
    "error writing all requested bytes to file"

I modified TDirectory::WriteObject to return 0 bytes in case of an error
while writing to the file.

I also modified the signature of TTree::AutoSave to return the number of bytes
written to the file. A null value indicates an error.

Rene Brun


Konstantin Olchanski wrote:
> 
> Rooters- in a production environement, our application is writing ROOT
> TTree files into an unreliable storage array (NFS to GPFS to FiberChannel
> to RAID-whatever). Any disk write errors (intermittent disk full,
> intermittent I/O errors, etc) corrupt the output ROOT file, so we
> want to catch the errors and stop the application.
> 
> It turns out that catching the disk errors while writing a ROOT TTree
> file is not simple. The TTree->AutoSave() and TTree->Fill() are "void"
> and do not return success or failure status.
> 
> One can check the TFile->TestBits(kWriteError), but some write
> errors corrupt the output file without setting the file->SetBit(kWriteError);
> 
> For example, consider this stack trace: (ROOT v3.10.2)
> 
> #0  0x420d6fb0 in write () from /lib/i686/libc.so.6
> #1  0x40102661 in TFile::SysWrite(int, void const*, int) (this=0x8896098, fd=143231640, buf=0xe8, len=-1) at base/src/TFile.cxx:2019
> #2  0x40100630 in TFile::WriteBuffer(char const*, int) (this=0x8896098, buf=0x8898a98 "", len=232) at base/src/TFile.cxx:1466
> #3  0x4010733f in TKey::WriteFile(int) (this=0x8896ee0, cycle=0) at base/src/TKey.cxx:762
> #4  0x40113264 in TObject::Write(char const*, int, int) (this=0x8896c40, name=0x0, option=0, bufsize=0) at base/src/TObject.cxx:889
> #5  0x40ade893 in TTree::AutoSave(char const*) (this=0x8896c40, option=0x80494b5 "") at tree/src/TTree.cxx:685
> #6  0x08049039 in savetree_ () at tree.cpp:156
> 
> If "write()" does not write all the data (the disk is full),
> it returns a short count to TKey::WriteFile(). There, the error condition is
> ignored, with the output file corrupted, and with "kWriteError" not
> flagged.
> 
> If the disk error is intermittent and goes away by the time we want to
> write something again (or if write() always returns a short count rather
> than -1), we get undetectable output file corruption.
> 
> We do get "error writing all requested bytes to file %s, wrote %d of %d" error messages to stderr, but the application cannot see them and continues chewing up cpu-hours writing an unreadable output file.
> 
> Even if TKey::WriteFile() were to propagate the error condition,
> it is again ignored in TTree::AutoSave() (wOK=Write(), wOK is not used),
> with output file corrupted but "kWriteError" not flagged.
> 
> I did not check if similar problems exist in the TTree->Fill() path.
> 
> Ideally, TTree->AutoSave() and TTree->Fill() should return an error
> status. Otherwise, we could detect the error and set the TFile::kWriteError
> bit in TKey::WriteFile() and elsewhere.
> 
> Any thoughts?
> Should I try to come up with a patch for flagging file->SetBit(kWriteError)?
> 
> --
> Konstantin Olchanski
> Data Acquisition Systems: The Bytes Must Flow!
> Email: olchansk-at-triumf-dot-ca
> Snail mail: 4004 Wesbrook Mall, TRIUMF, Vancouver, B.C., V6T 2A3, Canada



This archive was generated by hypermail 2b29 : Sun Jan 02 2005 - 05:50:08 MET