RootTalk


ROOT Discussion Forums

PROOF sub-merging bug?

Discuss PROOF, the Parallel ROOT Facility, here.

Moderator: rootdev

PROOF sub-merging bug?

Unread postby bbutler » Fri Apr 22, 2011 0:35

Hi Rooters (PROOFers?),

Here at SLAC we have a small PROOF cluster (8 machines, 98 cores), and in limit of high numbers of histograms I have been dealing with semi-random worker crashes as well as crashes during a very slow merging process. On the theory that maybe the slow merging was part of the issue, Shuwei from BNL suggested trying sub-merging enabled. Sub-merging seems to work fine with histograms/objects in the top-level directory of the ROOT file, but as soon as I add a TDirectoryFile object to the output file (I am using TProofOutputFile for output) the merging with sub-merging on seg-faults with this type of error:

===========================================================
#5 0x00002ab15eb6fa43 in TDirectoryFile::Get(char const*) ()
from /afs/slac.stanford.edu/g/atlas/packages/root/root_v5.28.00a.Linux-slc5_amd64-gcc4.3/lib/libRIO.so
#6 0x00002ab15eb6acf1 in TDirectoryFile::GetDirectory(char const*, bool, char const*) ()
from /afs/slac.stanford.edu/g/atlas/packages/root/root_v5.28.00a.Linux-slc5_amd64-gcc4.3/lib/libRIO.so
#7 0x00002ab16053d8e4 in TFileMerger::MergeRecursive(TDirectory*, TList*) ()
from /afs/slac.stanford.edu/g/atlas/packages/root/root_v5.28.00a.Linux-slc5_amd64-gcc4.3/lib/libProofPlayer.so
#8 0x00002ab16053e27f in TFileMerger::MergeRecursive(TDirectory*, TList*) ()
from /afs/slac.stanford.edu/g/atlas/packages/root/root_v5.28.00a.Linux-slc5_amd64-gcc4.3/lib/libProofPlayer.so
#9 0x00002ab16053d042 in TFileMerger::Merge(bool) ()
from /afs/slac.stanford.edu/g/atlas/packages/root/root_v5.28.00a.Linux-slc5_amd64-gcc4.3/lib/libProofPlayer.so
#10 0x00002ab160569fd8 in TProofPlayerRemote::MergeOutputFiles() ()
from /afs/slac.stanford.edu/g/atlas/packages/root/root_v5.28.00a.Linux-slc5_amd64-gcc4.3/lib/libProofPlayer.so
#11 0x00002ab16056f287 in TProofPlayerLite::Finalize(bool, bool) ()
from /afs/slac.stanford.edu/g/atlas/packages/root/root_v5.28.00a.Linux-slc5_amd64-gcc4.3/lib/libProofPlayer.so
#12 0x00002ab160570195 in TProofPlayerLite::Process(TDSet*, char const*, char const*, long long, long long) ()
from /afs/slac.stanford.edu/g/atlas/packages/root/root_v5.28.00a.Linux-slc5_amd64-gcc4.3/lib/libProofPlayer.so
#13 0x00002ab15f6dc437 in TProofLite::Process(TDSet*, char const*, char const*, long long, long long) ()
from /afs/slac.stanford.edu/g/atlas/packages/root/root_v5.28.00a.Linux-slc5_amd64-gcc4.3/lib/libProof.so
===========================================================

Am I doing something wrong, or is this a bug? We are using ROOT 5.28a.

Thanks,

Bart
bbutler
 
Posts: 46
Joined: Wed May 05, 2010 16:10

Re: PROOF sub-merging bug?

Unread postby bbutler » Fri Apr 22, 2011 0:39

I should probably note that the merging with TDirectoryFile objects works fine with sub-merging off, and it does not matter if the TDirectoryFile is empty or not, it always crashes.
bbutler
 
Posts: 46
Joined: Wed May 05, 2010 16:10

Re: PROOF sub-merging bug?

Unread postby ganis » Sat Apr 23, 2011 8:14

Hi,

No, it is not a known problem, but sub-merging is still somewhat experimental.
Unfortunately I am not currently at work, but I will have a look as soon as I am
back middle of next week.

G. Ganis
ganis
 
Posts: 787
Joined: Tue Sep 02, 2003 10:18
Location: CERN

Re: PROOF sub-merging bug?

Unread postby ganis » Fri Apr 29, 2011 17:26

Hi,

Update:
I think I managed to reproduce the problem, but I have not yet understood what goes wrong.
I hope to have more news soon.

G. Ganis
ganis
 
Posts: 787
Joined: Tue Sep 02, 2003 10:18
Location: CERN

Re: PROOF sub-merging bug?

Unread postby bbutler » Sat Apr 30, 2011 0:51

Excellent, thanks for looking into this.
bbutler
 
Posts: 46
Joined: Wed May 05, 2010 16:10

Re: PROOF sub-merging bug?

Unread postby bbutler » Wed May 11, 2011 13:12

Hello,

Any updates on this?

-Bart
bbutler
 
Posts: 46
Joined: Wed May 05, 2010 16:10

Re: PROOF sub-merging bug?

Unread postby ganis » Thu May 12, 2011 11:34

Hi,

I am sorry for the delay, I had some other things to follow and could not work much on this last week.
The solution is a bit tricky and has to do with the fact that for submergers we need a temporary set of intermediate files and this was not correctly handled (even when there was no crash).
I think I am close to have the fix and I am confident to be able to commit it later this afternoon (CET).

G. Ganis
ganis
 
Posts: 787
Joined: Tue Sep 02, 2003 10:18
Location: CERN

Re: PROOF sub-merging bug?

Unread postby bbutler » Thu May 12, 2011 16:43

Awesome. Is there any potential to have a 5.28a patch for this or would we just have to use the trunk/wait for the next tag?
bbutler
 
Posts: 46
Joined: Wed May 05, 2010 16:10

Re: PROOF sub-merging bug?

Unread postby ganis » Thu May 12, 2011 17:41

Hi,

I have just uploaded a fix into the trunk and 5-28-00-patches. I hope you will be able to try it.

bbutler wrote:Is there any potential to have a 5.28a patch for this or would we just have to use the trunk/wait for the next tag?


Since it is in 5-28-00-patches it will appear in the next tag on the branch, i.e. 5-28-00e, which will probably appear in the coming weeks. But we do not modify existing tags.

G. Ganis
ganis
 
Posts: 787
Joined: Tue Sep 02, 2003 10:18
Location: CERN

Re: PROOF sub-merging bug?

Unread postby bbutler » Fri May 13, 2011 0:23

Understood, that is great. Thanks a lot, and we will certainly try out the 5.28e.
bbutler
 
Posts: 46
Joined: Wed May 05, 2010 16:10

Re: PROOF sub-merging bug?

Unread postby bbutler » Fri Jul 15, 2011 22:36

The problem is certainly resolved for 5-30, thanks.
bbutler
 
Posts: 46
Joined: Wed May 05, 2010 16:10

Re: PROOF sub-merging bug?

Unread postby bbutler » Fri Jul 15, 2011 22:53

Looking at the code, using the option "ML" for merging with sub-mergers, no check is performed to see if the files are already local, right? I'm thinking of situations with multiple worker cores sharing a disk array, in those cases one would want not to perform a useless local copy operation for an already local file, but still copy the partially-merged files to the master for final merging. Maybe such a check could be added?
bbutler
 
Posts: 46
Joined: Wed May 05, 2010 16:10


Return to PROOF Support

Who is online

Users browsing this forum: No registered users and 1 guest