Re: parallel build does not work

From: Tom Roberts <tjrob_at_fnal.gov>
Date: Fri, 21 Aug 2009 11:36:02 -0500


Fons:

I looked for console messages about disk errors; there are none. Yes, it is not at all obvious which command failed. But there is a strong hint at the end of each log, for instance:

    rm core/utils/src/RStl_tmp.cxx core/utils/src/rootcint_tmp.cxx

I'm going to try this on Linux with 4 CPUs, and I'm going to try to get a log for a failure with "make -d -j 4" on both Mac OSX and Linux.

Jörn:

I believe your analysis is incorrect. A valid NFS server will serve a file that is in its cache, even if it has not yet been written to disk. And a valid NFS implementation won't consider a client-side file to be written until the server has acknowledged the transfer of data (i.e. ensuring the file is at least in the server's cache). So to me this looks like the same timing problem I'm trying to track: the build process should not try to read a just-built file until it has actually been built and written; I believe that the underlying problem is that somewhere it is not waiting correctly. There seems to be a subtle dependence on timing, such that Fons' system just happens to work, and mine and yours on NFS just happen to not work sometimes.

I see what looks to me like a timing dependency, when I build as testuser: On my boot drive, 2 out of 13 times "make -j 4" failed (plus many other failures for other circumstances). On my Scratch drive, 0 out of 16 failed. As I said before, in one case simply adding an empty directory to the front of PATH made failures much more likely.

This is very difficult to debug, it is both subtle and is a STATISTICAL thing (one success does not necessarily mean all is well, or even a dozen -- I've now had >30 successes as testuser, and yet my main-line build fails almost every time). I suggest an inspection of the makefiles regarding the construction of RStl_tmp.cxx and rootcint_tmp.cxx -- I'll bet some additional locking or re-ordering is needed to be truly parallel safe.

Tom Roberts

Joern Wuestenfeld wrote:
> Hi Tom,
>
> one question: How is the disk you build ROOT on connected to your box?
>
> I have seen problems with parallel builds if your disk is connected via
> NFS on a network with high latency.
> It may happen then, that the build process tries to read a file, that
> the just has build, but the server has not yet written it to disk.
>
> Regards,
>
> Jörn Wuestenfeld
>
>
Received on Fri Aug 21 2009 - 18:36:06 CEST

This archive was generated by hypermail 2.2.0 : Fri Aug 21 2009 - 23:50:02 CEST