[ROOT] TTree: broken disk-write error handling

From: Konstantin Olchanski (olchansk@sam.triumf.ca)
Date: Wed May 26 2004 - 01:35:26 MEST


Rooters- in a production environement, our application is writing ROOT
TTree files into an unreliable storage array (NFS to GPFS to FiberChannel
to RAID-whatever). Any disk write errors (intermittent disk full,
intermittent I/O errors, etc) corrupt the output ROOT file, so we
want to catch the errors and stop the application.

It turns out that catching the disk errors while writing a ROOT TTree
file is not simple. The TTree->AutoSave() and TTree->Fill() are "void"
and do not return success or failure status.

One can check the TFile->TestBits(kWriteError), but some write
errors corrupt the output file without setting the file->SetBit(kWriteError);

For example, consider this stack trace: (ROOT v3.10.2)

#0  0x420d6fb0 in write () from /lib/i686/libc.so.6
#1  0x40102661 in TFile::SysWrite(int, void const*, int) (this=0x8896098, fd=143231640, buf=0xe8, len=-1) at base/src/TFile.cxx:2019
#2  0x40100630 in TFile::WriteBuffer(char const*, int) (this=0x8896098, buf=0x8898a98 "", len=232) at base/src/TFile.cxx:1466
#3  0x4010733f in TKey::WriteFile(int) (this=0x8896ee0, cycle=0) at base/src/TKey.cxx:762
#4  0x40113264 in TObject::Write(char const*, int, int) (this=0x8896c40, name=0x0, option=0, bufsize=0) at base/src/TObject.cxx:889
#5  0x40ade893 in TTree::AutoSave(char const*) (this=0x8896c40, option=0x80494b5 "") at tree/src/TTree.cxx:685
#6  0x08049039 in savetree_ () at tree.cpp:156

If "write()" does not write all the data (the disk is full),
it returns a short count to TKey::WriteFile(). There, the error condition is
ignored, with the output file corrupted, and with "kWriteError" not
flagged.

If the disk error is intermittent and goes away by the time we want to
write something again (or if write() always returns a short count rather
than -1), we get undetectable output file corruption.

We do get "error writing all requested bytes to file %s, wrote %d of %d" error messages to stderr, but the application cannot see them and continues chewing up cpu-hours writing an unreadable output file.

Even if TKey::WriteFile() were to propagate the error condition,
it is again ignored in TTree::AutoSave() (wOK=Write(), wOK is not used),
with output file corrupted but "kWriteError" not flagged.

I did not check if similar problems exist in the TTree->Fill() path.

Ideally, TTree->AutoSave() and TTree->Fill() should return an error
status. Otherwise, we could detect the error and set the TFile::kWriteError
bit in TKey::WriteFile() and elsewhere.

Any thoughts?
Should I try to come up with a patch for flagging file->SetBit(kWriteError)?

-- 
Konstantin Olchanski
Data Acquisition Systems: The Bytes Must Flow!
Email: olchansk-at-triumf-dot-ca
Snail mail: 4004 Wesbrook Mall, TRIUMF, Vancouver, B.C., V6T 2A3, Canada



This archive was generated by hypermail 2b29 : Sun Jan 02 2005 - 05:50:08 MET