gProof->GetManager()->GetFile doesn't work

Hello,

I’m trying to use gProof->GetManager()->GetFile in the (compiled) program after a PROOF session and in such a case GetFile hangs at 16Kb or so. It doesn’t work even when I’m closing the session first and then call TProof::Mgr()->GetFIle().

However, in the ROOT console TProof::Mgr()->GetFIle() works just fine.

ROOT version: 5.26

Dear Eugeny,

Could you please provide the simplest code which after compilation reproduces the problem?
Please give also details about the way you compile it.

Gerri Ganis

As I just realized, it fails even in CINT.

This works fine:

bash-3.00$ rm /tmp/xrd 
bash-3.00$ root -l
root [0] TProof::Mgr("lxslc22")->GetFile("/afs/ihep.ac.cn/users/e/eugenyboger/root/bin/xrd","/tmp/")
[GetFile] Total 2.03 MB	|====================| 100.00 % [11.8 MB/s]
(Int_t)0
root [1] 

And this one does not:

bash-3.00$ rm /tmp/xrd 
bash-3.00$ root -l
root [0] p = TProof::Open("lxslc22")
Starting master: opening connection ...
Starting master: OK                                                 
Opening connections to workers: OK (6 workers)                 
Setting up worker servers: OK (6 workers)                 
PROOF set to parallel mode (6 workers)
(class TProof*)0x84f9770
root [1] TProof::Mgr("lxslc22")->GetFile("/afs/ihep.ac.cn/users/e/eugenyboger/root/bin/xrd","/tmp/")
[GetFile] Total 2.03 MB	|>...................| 0.00 % [0.1 MB/s]

And then root waits forever.

The size downloaded is almost constant:

bash-3.00$ ls -l /tmp/xrd 
-rw-------  1 eugenyboger software 32768 Feb 20 19:30 /tmp/xrd

update: a few minutes later it says

100220 21:40:45 19951 Proofx-I: Conn::DoHandShake: -----------------------
100220 21:40:45 19951 Proofx-I: Conn::DoHandShake: TimeOut condition reached reading from remote server.
100220 21:40:45 19951 Proofx-I: Conn::DoHandShake: This may indicate that the server is a 'proofd', version <= 12
100220 21:40:45 19951 Proofx-I: Conn::DoHandShake: Retry commenting the 'Plugin.TSlave' line in system.rootrc or adding
100220 21:40:45 19951 Proofx-I: Conn::DoHandShake: Plugin.TSlave: ^xpd  TSlave Proof "TSlave(const char *,const char *,int,const char *, TProof *,ESlaveType,const char *,const char *)"
100220 21:40:45 19951 Proofx-I: Conn::DoHandShake: to your $HOME/.rootrc .
100220 21:40:45 19951 Proofx-I: Conn::DoHandShake: -----------------------
100220 21:40:45 19951 Proofx-E: Conn::GetAccessToSrv: handShake failed with server [lxslc.ihep.ac.cn:1093]
100220 21:40:45 19951 Proofx-E: Conn::Connect: access to server failed ()

Ok, for some reason it tries to open a second manager instance and it fails.

In the error message that you got after a few minutes the server lxslc.ihep.ac.cn is mentioned: is this a real address? I.e. the one you expect?

Can you do the following:

$ root -l
root[0] p = TProof::Open("lxslc22")
...
root[1] p->GetManager()
root[2] TProof::Mgr("lxslc22")

and report the result?

Also, can you try with the full host name (“lxslc22.ihep.ac.cn” or whatever)?

Gerri Ganis

It seems like the manager is the same:

root [0] p = TProof::Open("lxslc22.ihep.ac.cn") Starting master: opening connection ... Starting master: OK Opening connections to workers: OK (6 workers) Setting up worker servers: OK (6 workers) PROOF set to parallel mode (1 worker) (class TProof*)0x84f9b00 root [1] p->GetManager() (class TProofMgr*)0x844b5e0 root [2] p->GetManager() (class TProofMgr*)0x844b5e0 root [3] TProof::Mgr("lxslc22.ihep.ac.cn") (class TProofMgr*)0x844b5e0 root [4] TProof::Mgr("lxslc22.ihep.ac.cn") (class TProofMgr*)0x844b5e0

You are right, “lxslc.ihep.ac.cn” is not valid host name. I think the only place it could come from is reverse dns record for IP-address associated with lxslc22.

The other thing I’ve just mentioned is as follows

root [4] p->GetManager()->GetFile("/afs/ihep.ac.cn/users/e/eugenyboger/root/bin/xrd","/tmp/")
Local file exists already: would you like to overwrite it? [N/y]y
[GetFile] Total 2.03 MB	|>...................Error: Symbol #include is not defined in current scope  (tmpfile):1:
Error: Symbol exception is not defined in current scope  (tmpfile):1:
Syntax Error: #include <exception> (tmpfile):1:
Error: Symbol G__exception is not defined in current scope  (tmpfile):1:
Error: type G__exception not defined FILE:(tmpfile) LINE:1
(Int_t)0
*** Interpreter error recovered ***
...
Root > .q
SysError in <TPosixCondition::~TPosixCondition>: pthread_cond_destroy error (No such file or directory)
SysError in <TPosixMutex::~TPosixMutex>: pthread_mutex_destroy error (No such file or directory)

It happens (instead of hang described above) when the destination file already exists.

Also performed some test on my laptop with local PROOF started: after

p = TProof::Open("localhost")

Calls to GetFile will eventually fail

root [0] p  = TProof::Open("localhost")                                         Starting master: opening connection ...
Starting master: OK                                                 
Opening connections to workers: OK (2 workers)                 
Setting up worker servers: OK (2 workers)                 
PROOF set to parallel mode (2 workers)
(class TProof*)0x84c7ef8
root [1] for (i=0;i<20; i++) TProof::Mgr("localhost")->GetFile("test.root","/tmp/","force")
Warning: Automatic variable i is allocated (tmpfile):1:
[GetFile] Total 0.10 MB	|====================| 100.00 % [14.1 MB/s]
[GetFile] Total 0.10 MB	|===>................Error: Symbol #include is not defined in current scope  (tmpfile):1:
Error: Symbol exception is not defined in current scope  (tmpfile):1:
Syntax Error: #include <exception> (tmpfile):1:
Error: Symbol G__exception is not defined in current scope  (tmpfile):1:
Error: type G__exception not defined FILE:(tmpfile) LINE:1
*** Interpreter error recovered ***
root [2] terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc

So the problem doesn’t seem to be related to particular cluster.

[/quote]

Hello, Gerri,

So, could you confirm that?

Hi Eugeny,

Unfortunately I cannot confirm the problem on my local machine (MacOsX).
I will try on other ones.

What happens if you run the PROOF test?

$ cd $ROOTSYS/test
$ make stressProof
$ ./stressProof

The ‘Admin’ test uses these methods.

Gerri

Sorry for the very late response.

The problem is still here, I have just seen it with latest svn and with different proof cluster (xrootd is from ROOT 5.26b) . Sometimes it works, sometimes it does not:

Mst-0: grand total: sent 57 objects, size: 15541714 bytes       
[GetFile] Total 0.06 MB |===================>terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Aborted

upd: errors seem to be somehow related to the file size

stressProof test show no errors.

However, these two simple steps almost always tend to hang:

p = TProof::Open("xrootd@lgdui01");
for (i=0;i<200; i++) p->GetManager()->GetFile("/data/pool/testfile.dd","/tmp/","force");

The output is like:

root [1] for (i=0;i<200; i++) p->GetManager()->GetFile("/data/pool/testfile.dd","/tmp/","force")
Warning: Automatic variable i is allocated (tmpfile):1:
[GetFile] Total 4.00 MB |====================| 100.00 % [10.8 MB/s]
[GetFile] Total 4.00 MB |====================| 100.00 % [10.8 MB/s]
[GetFile] Total 4.00 MB |====================| 100.00 % [10.8 MB/s]
[GetFile] Total 4.00 MB |====================| 100.00 % [10.8 MB/s]
[GetFile] Total 4.00 MB |====================| 100.00 % [10.8 MB/s]
[GetFile] Total 4.00 MB |====================| 100.00 % [10.8 MB/s]
[GetFile] Total 4.00 MB |====================| 100.00 % [10.9 MB/s]
[GetFile] Total 4.00 MB |====================| 100.00 % [10.9 MB/s]
[GetFile] Total 4.00 MB |====================| 100.00 % [10.9 MB/s]
[GetFile] Total 4.00 MB |====================| 100.00 % [10.9 MB/s]
[GetFile] Total 4.00 MB |==============>.....| 0.00 % [10.7 MB/s]

And console hang forever.
Don’t really know is it related to the problem discussed or not, but it definitely seems strange.