Re: tcmalloc-rleated hangs

From: <wlavrijsen_at_lbl.gov>
Date: Wed, 8 Sep 2010 11:09:48 -0700


Charles,

[.. added hn-atlas-SWArchitecture_at_cern.ch to the CC: ..]

> How can this hangup be prevented?

not using tcmalloc would be the obvious one, but apparently there have been several fixes to the particular code here, so moving to a newer tcmalloc may help as well (as usual with deadlocks, the authors have a hard time reproducing each and every case, so code being fixed may or may not be an actual fix). Note that we're still running tcmalloc 0.99.2b in 15.6.10 AFAICS.

Another option that I found on the interwebs, is setting the envar TCMALLOC_MAX_FREE_QUEUE_SIZE to 0. Doing so should prevent tcmalloc to keep its own internal structures for managing free memory. However, at that point it may be just as well to run w/o tcmalloc (--stdcmalloc as option to athena), given that any speed-up will likely be gone.

Best regards,

            Wim

--
WLavrijsen_at_lbl.gov    --    +1 (510) 486 6411    --    www.lavrijsen.net

On Wed, 8 Sep 2010, Charles G Waldman wrote:

> Example job:
>
> http://panda.cern.ch:25980/server/pandamon/query?job=1110280594
>
> 6 hour wall time, stalled after 50 minutes of CPU time
>
> resources_used.walltime = 05:53:30
> resources_used.cput = 00:51:42
> resources_used.mem = 4000284kb
> resources_used.vmem = 4850496kb
>
> The tmp.stderr.b8bf4590-968c-45b7-b5c9-bbe8496ac53e file shows a failing
> memory allocation (not surprising since we're at or near the 4GB limit)
>
> [64] Using existing buffer to read 2016 bytes.
> src/tcmalloc.cc:1902] allocation failed: 12
> [64] SEEK_SET inside Read-ahead buffer. Expected position 1348720729
> [64] Using existing buffer to read 2016 bytes.
>
> However, the tcmalloc failure is resulting in a hang (SpinLock) rather than
> program termination as seen in the stack trace below:
>
> How can this hangup be prevented?
>
>
> #0 0xffffe425 in __kernel_vsyscall ()
> #1 0x00ca5ca0 in __nanosleep_nocancel () from /lib/libpthread.so.0
> #2 0xf7745991 in SpinLock::SlowLock (this=0xf7761400)
> at src/base/spinlock.cc:99
> #3 0xf7742955 in TCMalloc_Central_FreeList::RemoveRange (this=0xf7761400,
> start=0xffd663c8, end=0xffd663c4, N=7) at src/base/spinlock.h:80
> #4 0xf7742abe in TCMalloc_ThreadCache::FetchFromCentralCache (this=0x8172000,
> cl=55, byte_size=8704) at src/tcmalloc.cc:2020
> #5 0xf7747608 in operator new[] (size=8504) at src/tcmalloc.cc:1989
> #6 0xf56fd4c8 in TBasket::ReadBasketBuffers ()
> from /share/osg/app/atlas_app/atlas_rel/15.6.10/DetCommon/15.6.10/InstallArea/i686-slc5-gcc43-opt/lib/libTree.so
> #7 0xf5705a82 in TBranch::GetBasket ()
> from /share/osg/app/atlas_app/atlas_rel/15.6.10/DetCommon/15.6.10/InstallArea/i686-slc5-gcc43-opt/lib/libTree.so
> #8 0xf570629d in TBranch::GetEntry ()
> from /share/osg/app/atlas_app/atlas_rel/15.6.10/DetCommon/15.6.10/InstallArea/i686-slc5-gcc43-opt/lib/libTree.so
> #9 0xf24365f7 in TConvertingBranchElement::GetEntry ()
> from /share/osg/app/atlas_app/atlas_rel/15.6.10/AtlasCore/15.6.10/InstallArea/i686-slc5-gcc43-opt/lib/libRootConversions.so
> #10 0xf2435e52 in TConvertingBranchElement::ReadSubBranches ()
> from /share/osg/app/atlas_app/atlas_rel/15.6.10/AtlasCore/15.6.10/InstallArea/i686-slc5-gcc43-opt/lib/libRootConversions.so
Received on Wed Sep 08 2010 - 20:10:11 CEST

This archive was generated by hypermail 2.2.0 : Fri Sep 10 2010 - 23:50:01 CEST