Re: Re: tcmalloc-rleated hangs

From: Philippe Canal <pcanal_at_fnal.gov>
Date: Fri, 10 Sep 2010 15:44:43 -0500


  Hi,

I am guessing that the exact allocation that provokes the problem is likely to be 'random', however the allocation that provokes the problem that time:

#5 0xf7747608 in operator new[] (size=8504) at src/tcmalloc.cc:1989 #6 0xf56fd4c8 in TBasket::ReadBasketBuffers ()

   from /share/osg/app/atlas_app/atlas_rel/15.6.10/DetCommon/15.6.10/InstallArea/i686-slc5-gcc43-opt/lib/libTree.so #7 0xf5705a82 in TBranch::GetBasket ()

   from /share/osg/app/atlas_app/atlas_rel/15.6.10/DetCommon/15.6.10/InstallArea/i686-slc5-gcc43-opt/lib/libTree.so

has not been factored out (i.e. the allocation are now once per branch rather than once per basket) by revision 35225,35226 and 35231.

Cheers,
Philippe.

On 9/8/10 1:09 PM, wlavrijsen_at_lbl.gov wrote:
> Charles,
>
> [.. added hn-atlas-SWArchitecture_at_cern.ch to the CC: ..]
>
>> How can this hangup be prevented?
>
> not using tcmalloc would be the obvious one, but apparently there have been
> several fixes to the particular code here, so moving to a newer tcmalloc may
> help as well (as usual with deadlocks, the authors have a hard time reproducing
> each and every case, so code being fixed may or may not be an actual fix). Note
> that we're still running tcmalloc 0.99.2b in 15.6.10 AFAICS.
>
> Another option that I found on the interwebs, is setting the envar
> TCMALLOC_MAX_FREE_QUEUE_SIZE to 0. Doing so should prevent tcmalloc to keep
> its own internal structures for managing free memory. However, at that point
> it may be just as well to run w/o tcmalloc (--stdcmalloc as option to athena),
> given that any speed-up will likely be gone.
>
> Best regards,
> Wim
> --
> WLavrijsen_at_lbl.gov -- +1 (510) 486 6411 -- www.lavrijsen.net
>
> On Wed, 8 Sep 2010, Charles G Waldman wrote:
>> Example job:
>>
>> http://panda.cern.ch:25980/server/pandamon/query?job=1110280594
>>
>> 6 hour wall time, stalled after 50 minutes of CPU time
>>
>> resources_used.walltime = 05:53:30
>> resources_used.cput = 00:51:42
>> resources_used.mem = 4000284kb
>> resources_used.vmem = 4850496kb
>>
>> The tmp.stderr.b8bf4590-968c-45b7-b5c9-bbe8496ac53e file shows a failing
>> memory allocation (not surprising since we're at or near the 4GB limit)
>>
>> [64] Using existing buffer to read 2016 bytes.
>> src/tcmalloc.cc:1902] allocation failed: 12
>> [64] SEEK_SET inside Read-ahead buffer. Expected position 1348720729
>> [64] Using existing buffer to read 2016 bytes.
>>
>> However, the tcmalloc failure is resulting in a hang (SpinLock) rather than
>> program termination as seen in the stack trace below:
>>
>> How can this hangup be prevented?
>>
>>
>> #0 0xffffe425 in __kernel_vsyscall ()
>> #1 0x00ca5ca0 in __nanosleep_nocancel () from /lib/libpthread.so.0
>> #2 0xf7745991 in SpinLock::SlowLock (this=0xf7761400)
>> at src/base/spinlock.cc:99
>> #3 0xf7742955 in TCMalloc_Central_FreeList::RemoveRange (this=0xf7761400,
>> start=0xffd663c8, end=0xffd663c4, N=7) at src/base/spinlock.h:80
>> #4 0xf7742abe in TCMalloc_ThreadCache::FetchFromCentralCache (this=0x8172000,
>> cl=55, byte_size=8704) at src/tcmalloc.cc:2020
>> #5 0xf7747608 in operator new[] (size=8504) at src/tcmalloc.cc:1989
>> #6 0xf56fd4c8 in TBasket::ReadBasketBuffers ()
>> from /share/osg/app/atlas_app/atlas_rel/15.6.10/DetCommon/15.6.10/InstallArea/i686-slc5-gcc43-opt/lib/libTree.so
>> #7 0xf5705a82 in TBranch::GetBasket ()
>> from /share/osg/app/atlas_app/atlas_rel/15.6.10/DetCommon/15.6.10/InstallArea/i686-slc5-gcc43-opt/lib/libTree.so
>> #8 0xf570629d in TBranch::GetEntry ()
>> from /share/osg/app/atlas_app/atlas_rel/15.6.10/DetCommon/15.6.10/InstallArea/i686-slc5-gcc43-opt/lib/libTree.so
>> #9 0xf24365f7 in TConvertingBranchElement::GetEntry ()
>> from /share/osg/app/atlas_app/atlas_rel/15.6.10/AtlasCore/15.6.10/InstallArea/i686-slc5-gcc43-opt/lib/libRootConversions.so
>> #10 0xf2435e52 in TConvertingBranchElement::ReadSubBranches ()
>> from /share/osg/app/atlas_app/atlas_rel/15.6.10/AtlasCore/15.6.10/InstallArea/i686-slc5-gcc43-opt/lib/libRootConversions.so
>
Received on Fri Sep 10 2010 - 22:44:58 CEST

This archive was generated by hypermail 2.2.0 : Fri Sep 10 2010 - 23:50:01 CEST