Re: (retry) PROOF and I/O from Fons Rademakers on 2010-07-08 (RootTalk)

From: Fons Rademakers <Fons.Rademakers_at_cern.ch>
Date: Thu, 8 Jul 2010 02:48:06 +0200

Hi Doug,

it looks like you are doing most things right, turning on the TTree cache and reading only the branches you need. Now some performance numbers, ROOT can read about 20MB/s of compressed data in C++. So to fully load the 16 cores (while hyperthreaded cores are not real cores they are generally <50% effective) so about max 10-12 cores will process 200 to 240 MB/s. If the NFS data is coming in over a single GigEth link you will get max 80 MB/s, with about the same for a local 7200rpm. Which means only 4 cores running at 100%. To use all cores optimally you will need a RAID-0 of several HDDs or SSDs or several trunked GB eth links or 10 GE to you data servers (which much have a matching disk setup).

The usage of memcached will not help much as streaming the files through the memcached will go with the disk speed anyway. Memcached is interesting when the entire data set can stay cached. The same can be achieved by having a lot of RAM in the diskserver/cluster and letting the OS buffer cache doing the caching. The first time you query the dataset you go with disk speed, after that you read from RAM (and only the buffers read by ROOT are in buffer cache and not the entire files).

The 2.5MB/s however points to the usage of python in TPySelector being the issue here, especially since you claim the selector is quite complicated. Python normally is about an order of magnitude slower than C++. Could you rewrite your analysis in C++?

Cheers, Fons.

On 08/07/2010 01:55, Doug Schouten wrote:
> Hi,
>
> I am writing some fairly complicated selectors (TPySelector's actually) and
> I notice that, particularly when accessing data over NFS, the PROOF slaves
> quickly become I/O bound, as I see many proofserve.exe processes sitting
> nearly idle. This also happens using data only on local disk (RAID-5, 7200
> rpm Seagate Barracuda's ... so can't improve things to much there).
>
> I have tried increasing the TTree caching using t.SetCacheSize(), and I
> have also slimmed the ROOT files considerably and turned off all the
> branches with SetBranchStatus() that I don't need at run-time.
>
> However, I still see relatively poor performance in terms of CPU usage. I
> have 16-core machines (albeit with hyper-threading) and I would like to
> utilize them better.
>
> So my question is two-fold:
>
> (1) are there some methods/tips/tricks to improve performance? Are there
> caching parameters that I can set somewhere to prefetch files/trees in
> larger chunks? Currently I am processing my datasets at ~ 2.5 MB/s,
> reported by the PROOF GUI, which is pretty slow IMHO. However, I think this
> is actually the rate of data being analyzed and not the rate at which I am
> reading through the files, which I guess are two very different things for
> large trees with many branches that I am not using. Am I right about this?
>
> (2) anticipating that there are no easy solutions in (1), has anyone heard
> of memcached? This is a distributed memory cache which one can use to pool
> extra RAM from multiple machines. One can then use a FUSE filesystem,
> memcachefs, to store files in pooled memory. I am wondering how I could
> possibly interface this with the TDSet infrastructure in PROOF. In
> particular, I imagine a FIFO buffer manager that pre-fetches files in a
> TDSet and kicks out already-processed ones, running in a separate
> thread/process somewhere on my cluster. Somehow, I would have to trick
> PROOF to not verify the files before running the workers (because they
> would only 'arrive' in the cache just before they are needed), and I would
> have to have some way of communicating where I am in the TDSet list of
> files to the cache manager so that I can grab the next N files and place
> them in the cache. Then, if the memory cache is large enough, or if I can
> copy files into it ~ as fast as I process them, hopefully I can lessen the
> I/O constraints since reading from this cache will be constrained only by
> network latency and some (apparently) very small CPU overhead in memcached.
>
> (Note: there is also a C++ API for memcached which can deal with arbitrary
> chunks of data, not restricted to whole files, but I imagine this would be
> even more low-level and complicated.)
>
> thanks,
> Doug
>

-- 
Org:    CERN, European Laboratory for Particle Physics.
Mail:   1211 Geneve 23, Switzerland
E-Mail: Fons.Rademakers_at_cern.ch              Phone: +41 22 7679248
WWW:    http://fons.rademakers.org           Fax:   +41 22 7669640

Received on Thu Jul 08 2010 - 02:48:11 CEST

This archive was generated by hypermail 2.2.0 : Thu Jul 08 2010 - 11:50:01 CEST