Re: parallel build does not work

From: Philippe Canal <pcanal_at_fnal.gov>
Date: Fri, 21 Aug 2009 12:16:05 -0500


Hi Tom

 > rm core/utils/src/RStl_tmp.cxx core/utils/src/rootcint_tmp.cxx

I think this might be the consequences rather than the cause of the problem (i.e. those are temporary files that are often deleted by gmake when the make fails).

Cheers,
Philippe.

Tom Roberts wrote:
> Fons:
>
> I looked for console messages about disk errors; there are none. Yes,
> it is not at all obvious which command failed. But there is a strong
> hint at the end of each log, for instance:
> rm core/utils/src/RStl_tmp.cxx core/utils/src/rootcint_tmp.cxx
>
> I'm going to try this on Linux with 4 CPUs, and I'm going to try to
> get a log for a failure with "make -d -j 4" on both Mac OSX and Linux.
>
>
>
> Jörn:
>
> I believe your analysis is incorrect. A valid NFS server will serve a
> file that is in its cache, even if it has not yet been written to
> disk. And a valid NFS implementation won't consider a client-side file
> to be written until the server has acknowledged the transfer of data
> (i.e. ensuring the file is at least in the server's cache). So to me
> this looks like the same timing problem I'm trying to track: the build
> process should not try to read a just-built file until it has actually
> been built and written; I believe that the underlying problem is that
> somewhere it is not waiting correctly. There seems to be a subtle
> dependence on timing, such that Fons' system just happens to work, and
> mine and yours on NFS just happen to not work sometimes.
>
> I see what looks to me like a timing dependency, when I build as
> testuser: On my boot drive, 2 out of 13 times "make -j 4" failed (plus
> many other failures for other circumstances). On my Scratch drive, 0
> out of 16 failed. As I said before, in one case simply adding an empty
> directory to the front of PATH made failures much more likely.
>
> This is very difficult to debug, it is both subtle and is a
> STATISTICAL thing (one success does not necessarily mean all is well,
> or even a dozen -- I've now had >30 successes as testuser, and yet my
> main-line build fails almost every time). I suggest an inspection of
> the makefiles regarding the construction of RStl_tmp.cxx and
> rootcint_tmp.cxx -- I'll bet some additional locking or re-ordering is
> needed to be truly parallel safe.
>
>
> Tom Roberts
>
>
> Joern Wuestenfeld wrote:
>> Hi Tom,
>>
>> one question: How is the disk you build ROOT on connected to your box?
>>
>> I have seen problems with parallel builds if your disk is connected
>> via NFS on a network with high latency.
>> It may happen then, that the build process tries to read a file, that
>> the just has build, but the server has not yet written it to disk.
>>
>> Regards,
>>
>> Jörn Wuestenfeld
>>
>>
>
Received on Fri Aug 21 2009 - 19:16:09 CEST

This archive was generated by hypermail 2.2.0 : Mon Aug 24 2009 - 17:50:02 CEST