Parallelizing loops on TTree

miqueias.ma · November 17, 2015, 6:41pm

Hi everybody!

I’m working in a code in which I have 3 Trees and I pick up one event of the first Tree (for instance) and make some calculation between it and all the events in the second and third Tree. Now I’m doing this sequentially, which means, loop on the second Tree and after loop on the third Tree. I want to do the calculation between the events in both Tree at the same time. Is it Proof a good solution for that? Or is it better to use TProcPool? Could you give me some example how to make that?

ganis · November 18, 2015, 9:43am

Dear miqueias.ma,

You can probably do with both, but it would probably much easier to do it with multiproc, because PROOF is designed to work on a single Tree and independent entries.
However, multiproc is only available in the master head (and v6.06-rc) and works only on a multicore machine (i.e. not a cluster). I don’t know if this is your case.

I’ll try to provide you examples for both.

G Ganis

miqueias.ma · November 18, 2015, 12:04pm

[quote=“ganis”]Dear miqueias.ma,

You can probably do with both, but it would probably much easier to do it with multiproc, because PROOF is designed to work on a single Tree and independent entries.
However, multiproc is only available in the master head (and v6.06-rc) and works only on a multicore machine (i.e. not a cluster). I don’t know if this is your case.

I’ll try to provide you examples for both.

G Ganis[/quote]

Dear Ganis, thank you by the fast reply! Yes, that is my case… the code runs in a multicore machine. I saw the ROOT page about TProcPool, but it is not straightforward to understand how this class works, it will be good if you could give some example. And… could you explain me what do you mean with “master head”? That is about ROOT version?

ganis · November 26, 2015, 5:35pm

Dear miqueias.ma,

Attached you’ll find examples how to do this with Proof-Lite (macro proof_twoTrees.C + selector SetTwoTrees.C,h) and with TProcPool (macro mp200_twoTrees.C). It uses sample files from the ROOT site and just makes a senseless addition to fill an histogram, but it should illustrate the concept.

On my machine TProcPool is faster, but you need the master of the ROOT git repository or the forthcoming v6.06/00 .

Let me know if you have any questions.

G Ganis
twoTrees.tgz (2.13 KB)

miqueias.ma · November 28, 2015, 6:39pm

Dear Ganis,

thank you so much by attention! I will look on the codes you sent and as soon as possible I comment more about. I understood the question about ROOT version and already called the latest update of the package from the git repository!

miqueias.ma · December 7, 2015, 5:13pm

Dear Ganis,

How can I use stored matrix in the Trees? I tested in the examples you sent and for 1D arrays the code works if I use the class “TTreeReaderArray”. But, I have to work with 3D arrays and in this case I still could not find a way to implement in the code. I search for some class similar to TTreeReaderArray, but I did not find. Do you know if it is possible to do that?

pcanal · December 7, 2015, 8:22pm

Hi,

How are the matrix stored in the file?

Philippe.

miqueias.ma · December 7, 2015, 9:37pm

The matrix are 3D array of variables type Double_t. I mean… I have this variable “Double_t RecoParticle[6][3][2]”

miqueias.ma · December 8, 2015, 12:01am

[quote=“ganis”]Dear miqueias.ma,

Attached you’ll find examples how to do this with Proof-Lite (macro proof_twoTrees.C + selector SetTwoTrees.C,h) and with TProcPool (macro mp200_twoTrees.C). It uses sample files from the ROOT site and just makes a senseless addition to fill an histogram, but it should illustrate the concept.

On my machine TProcPool is faster, but you need the master of the ROOT git repository or the forthcoming v6.06/00 .

Let me know if you have any questions.

G Ganis[/quote]

Dear Ganis,

I change my variables to become more close to the example you gave me (“mp200_twoTrees.C”). I can run it without problem if I make some simple operation inside the loop. But, I need to call a function that makes a more complex math and when I do that the code fails with the message:

"pure virtual method called
terminate called without an active exception
pure virtual method called
terminate called without an active exception
[E][C] Lost connection to a worker
[E][C] Lost connection to a worker

*** Break *** segmentation violation"

In attachment I put the code to clear what I need to do…
FastME_ProcPool2.C (7.82 KB)

ganis · December 8, 2015, 1:37pm

Dear miqueias.ma,

In your code you are throwing an exception without handling it at upper level. This will just make the worker exit with no termination.
Can that happen?
Anyhow, I think you should just

return event_distance;

in ComputeDR_MinDist(…), and in workItem fill the histogram only if dr_test is >0 .
Can you try that?

G Ganis

miqueias.ma · December 8, 2015, 7:17pm

[quote=“ganis”]Dear miqueias.ma,

In your code you are throwing an exception without handling it at upper level. This will just make the worker exit with no termination.
Can that happen?
Anyhow, I think you should just

return event_distance;

in ComputeDR_MinDist(…), and in workItem fill the histogram only if dr_test is >0 .
Can you try that?

G Ganis[/quote]

Dear Ganis,

I did your suggestion but the same error still arise when I run the code… If I put that function inside the loop the code runs without problem and the plot confirms that everything is ok. Do you have other suggestion I can try?
Beyond that, I want to make a more complex thing than fill a histogram. I want to pick up the smallest distance found when I compare the events of one tree to the other two trees. And so, I want to compare those smallest values found to decide what tree has the most close event to the event in the main tree (just to remember I have 3 trees and I want to compare each event of one of those trees - that could be data, for instance - to the events in the other 2 trees - that could be MC, for instance). Can you help me with that goal?

ganis · December 9, 2015, 4:13pm

Dear miqueias.ma,

The complexity of the function should not really matter. I have tried to reproduce the problem with a similar structure in mp200_twoTrees.C but I do not get failures.
I can offer to debug directly your case but for that you need to make available your files or at least subsamples of them.

I think the output of each workItem function should be a (small) TObject-derived class containing the result of the loop, e.g. the smallest value, the related event number, the file … for example:

class MyResult : public TObject {
public:
    Double_t   fMin; // smallest value
    Long64_t  fEvtMin; // event giving the min
    TString     fFileMin; // file containing fEvtMin
    MyResult(Double_t m = -1., Long64_t e = -1, const char *f = 0) : fMin(m), fEvtMin(e), fFileMin(f) { }
    
    Long64_t Merge(TCollection *list);
};

When all workItem are finished TProcPool::ProcTree will automatically call the Merge method on the first instance passing a list of the other instances; it could be something like this:

Long64_t MyResut::Merge(TCollection *list)
{
    TIter nxres(list);
    MyResult *myr = 0;
    while ((myr = (MyResult *)nxres())) {
        if (myr->fMin < fMin) {
            fMin = myr->fMin;
            fEvtMin = myr->fEvtMin;
            fFileMin = myr->fFileMin;
        }
    }
}

In the end ProcTree will return you the resulting instance of MyResult.

G Ganis

miqueias.ma · December 10, 2015, 5:31am

Dear Ganis,

I uploaded the code I have managed here and some root files I’m trying to use in this analysis. I apologize but this is new for me and I’m still trying to understand some concepts about this parallelization. I said before that the code was ok… but it’s no true. I did a recent test (throw in the screen the values computed inside the loops in the workItem) and I saw that actually the code is not working as I want. It seems to look just to one event in the main root file (what I called data). So, if possible… could you take a look and explain me what is wrong? The code should look to all events in the main ntuples and not only in the MC ntuples. You will see that I’m using the function inside the loops because I could not call it… when I try to do that the error about “lost the workers” arise.
NewVBF_QED4_QCD0_SIG_FastME_2.root (1.66 MB)
NewVBF_QED4_QCD0_SIG_FastME_1.root (1.66 MB)
NewVBF_QED4_QCD0_BKG_FastME_2.root (1.67 MB)
FastME_ProcPool2.C (9.2 KB)

miqueias.ma · December 11, 2015, 1:22am

Dear Ganis,

Thank you by attention! I solved the problem about the reading in the main tree. Actually is better to use a “for” loop instead of “while” in this particular case. In such way, one can controls the event to be accessed using the “TTreeReader::SetEntry()” class. But, still there’s a trouble: a compared the time spent by my original code (without TProcPool) and this new code (I uploaded) to process the ntuples and my original code spent half of the time needed to TProcPool does the analysis. What could be the reason? I think that this is not associated to function inside the loop…
Actually I saw that is more slow to read the structure I’m using now than the matrix I had before. If you know how can I use them in this code I will appreciate so much
**The PC I’m working has only 2 cores (this also could be a problem?)
FastME_ProcPool2.C (9.27 KB)

ganis · December 15, 2015, 10:53am

Dear miqueias.ma,

The performance issue is a bit weird, we had no evidence of that in our preliminary evaluation of the tool.
Given the way the program is structured and with two data files, ncores=2 gives the minimum processing time … more processes step on each other feet in accessing the files.
How is structured the sequential program? Can you send it to me?
Anyhow, I am profiling what you sent to see if and where there are hot spots in execution.

G Ganis

miqueias.ma · December 15, 2015, 4:55pm

[quote=“ganis”]Dear miqueias.ma,

The performance issue is a bit weird, we had no evidence of that in our preliminary evaluation of the tool.
Given the way the program is structured and with two data files, ncores=2 gives the minimum processing time … more processes step on each other feet in accessing the files.
How is structured the sequential program? Can you send it to me?
Anyhow, I am profiling what you sent to see if and where there are hot spots in execution.

G Ganis[/quote]

Dear Ganis,

Thank you so much for all this help. I’m not a expert, so this developing is a few hard to me. I tested the code also on lxplus (that has more cores). I think the delay may be a question of core number. Once I have just 2 cores in my PC and I put all of them to work, there is no more core available to do other possible process related to the code/ or the other machine process and so, the cores are not being used efficiently (since those process can not be avoided). The test on lxplus was pretty good to me… the same quantity of data was processed in ~1.3min while in my original code was ~4.5min. And this time does not increase even I use more MC samples. I also did a little trick to avoid the complexity of analysis on each result from each core. I just used a 2D histogram to store the results and at the end I can easily get back the values (actually is a fast way to my need). Unfortunately, I still was not able to use my function outside of workItem. But I think that now the code is working in the way I want. So, I think I can mark the post as solved. If you can manage how to put the function outside the workItem or send I example, will be pretty nice.
Thank you so much!!

ganis · December 15, 2015, 5:27pm

Dear miqueias.ma,

I have slightly modified your macro to:

have an option to run the sequential and paralle cases with the same code;
use TStopwatch to measure the time; it is more practical.

With your settings, i.e. nData = 100, I get running times of the order of a few seconds and no speedup using multiproc:


root [1] FastME_ProcPool2n(2)
::: Starting Analysis with TProcPool :::
Info in <TCanvas::MakeDefCanvas>:  created default TCanvas with name c1
Real time 0:00:03, CP time 0.060
...
root [1] FastME_ProcPool2n(0)
::: Starting Analysis in sequential mode :::
Info in <TCanvas::MakeDefCanvas>:  created default TCanvas with name c1
Real time 0:00:03, CP time 3.550
root [2] .q

However, if a set nData=1000 I get the scaling I expect:

root [1] FastME_ProcPool2n(0)
::: Starting Analysis in sequential mode :::
Info in <TCanvas::MakeDefCanvas>:  created default TCanvas with name c1
Real time 0:00:33, CP time 33.050
...
root [1] FastME_ProcPool2n(2)
::: Starting Analysis with TProcPool :::
Info in <TCanvas::MakeDefCanvas>:  created default TCanvas with name c1
Real time 0:00:17, CP time 0.050

That is, there is some overhead in setting up the multiproc machinery, which maybe dominant for jobs of a few seconds.

I will analyse the profiling results and see if there are any obvious improvements,

G Ganis
FastME_ProcPool2n.C (9.99 KB)

miqueias.ma · December 16, 2015, 5:23am

Thank you again! I liked the TStopwatch it is much better… I didn’t know about it. Did you try to put the function (the ComputeDR_MinDist) outside of the workItem? I tried here call it inside the workItem in many ways… but I could not achieve that. I also tried with more simple functions… but seems that is not possible to use functions. I had to put it explicitly inside of workItem.