Re: Re: tcmalloc-rleated hangs

From: Paul Russo <russo_at_fnal.gov>
Date: Fri, 10 Sep 2010 16:14:35 -0500


> has not been factored out (i.e. the allocation are now once per branch

     /
  now

I'm pretty sure Phillipe meant "now" here, not "not".

> rather than once
> per basket) by revision 35225,35226 and 35231.

On 9/10/2010 3:44 PM, Philippe Canal wrote:

>  Hi,
> 
> I am guessing that the exact allocation that provokes the problem is
> likely to be 'random',
> however the allocation that provokes the problem that time:
> 
> #5  0xf7747608 in operator new[] (size=8504) at src/tcmalloc.cc:1989
> #6  0xf56fd4c8 in TBasket::ReadBasketBuffers ()
>   from
> /share/osg/app/atlas_app/atlas_rel/15.6.10/DetCommon/15.6.10/InstallArea/i686-slc5-gcc43-opt/lib/libTree.so
> 
> #7  0xf5705a82 in TBranch::GetBasket ()
>   from
> /share/osg/app/atlas_app/atlas_rel/15.6.10/DetCommon/15.6.10/InstallArea/i686-slc5-gcc43-opt/lib/libTree.so
> 
> 
> has not been factored out (i.e. the allocation are now once per branch
> rather than once
> per basket) by revision 35225,35226 and 35231.
> 
> Cheers,
> Philippe.
> 
> On 9/8/10 1:09 PM, wlavrijsen_at_lbl.gov wrote:

>> Charles,
>>
>> [.. added hn-atlas-SWArchitecture_at_cern.ch to the CC: ..]
>>
>>> How can this hangup be prevented?
>>
>> not using tcmalloc would be the obvious one, but apparently there have
>> been
>> several fixes to the particular code here, so moving to a newer
>> tcmalloc may
>> help as well (as usual with deadlocks, the authors have a hard time
>> reproducing
>> each and every case, so code being fixed may or may not be an actual
>> fix). Note
>> that we're still running tcmalloc 0.99.2b in 15.6.10 AFAICS.
>>
>> Another option that I found on the interwebs, is setting the envar
>> TCMALLOC_MAX_FREE_QUEUE_SIZE to 0. Doing so should prevent tcmalloc to
>> keep
>> its own internal structures for managing free memory. However, at that
>> point
>> it may be just as well to run w/o tcmalloc (--stdcmalloc as option to
>> athena),
>> given that any speed-up will likely be gone.
>>
>> Best regards,
>> Wim
>> --
>> WLavrijsen_at_lbl.gov -- +1 (510) 486 6411 -- www.lavrijsen.net
>>
>> On Wed, 8 Sep 2010, Charles G Waldman wrote:
>>> Example job:
>>>
>>> http://panda.cern.ch:25980/server/pandamon/query?job=1110280594
>>>
>>> 6 hour wall time, stalled after 50 minutes of CPU time
>>>
>>> resources_used.walltime = 05:53:30
>>> resources_used.cput = 00:51:42
>>> resources_used.mem = 4000284kb
>>> resources_used.vmem = 4850496kb
>>>
>>> The tmp.stderr.b8bf4590-968c-45b7-b5c9-bbe8496ac53e file shows a failing
>>> memory allocation (not surprising since we're at or near the 4GB limit)
>>>
>>> [64] Using existing buffer to read 2016 bytes.
>>> src/tcmalloc.cc:1902] allocation failed: 12
>>> [64] SEEK_SET inside Read-ahead buffer. Expected position 1348720729
>>> [64] Using existing buffer to read 2016 bytes.
>>>
>>> However, the tcmalloc failure is resulting in a hang (SpinLock)
>>> rather than
>>> program termination as seen in the stack trace below:
>>>
>>> How can this hangup be prevented?
>>>
>>>
>>> #0 0xffffe425 in __kernel_vsyscall ()
>>> #1 0x00ca5ca0 in __nanosleep_nocancel () from /lib/libpthread.so.0
>>> #2 0xf7745991 in SpinLock::SlowLock (this=0xf7761400)
>>> at src/base/spinlock.cc:99
>>> #3 0xf7742955 in TCMalloc_Central_FreeList::RemoveRange
>>> (this=0xf7761400,
>>> start=0xffd663c8, end=0xffd663c4, N=7) at src/base/spinlock.h:80
>>> #4 0xf7742abe in TCMalloc_ThreadCache::FetchFromCentralCache
>>> (this=0x8172000,
>>> cl=55, byte_size=8704) at src/tcmalloc.cc:2020
>>> #5 0xf7747608 in operator new[] (size=8504) at src/tcmalloc.cc:1989
>>> #6 0xf56fd4c8 in TBasket::ReadBasketBuffers ()
>>> from
>>> /share/osg/app/atlas_app/atlas_rel/15.6.10/DetCommon/15.6.10/InstallArea/i686-slc5-gcc43-opt/lib/libTree.so
>>>
>>> #7 0xf5705a82 in TBranch::GetBasket ()
>>> from
>>> /share/osg/app/atlas_app/atlas_rel/15.6.10/DetCommon/15.6.10/InstallArea/i686-slc5-gcc43-opt/lib/libTree.so
>>>
>>> #8 0xf570629d in TBranch::GetEntry ()
>>> from
>>> /share/osg/app/atlas_app/atlas_rel/15.6.10/DetCommon/15.6.10/InstallArea/i686-slc5-gcc43-opt/lib/libTree.so
>>>
>>> #9 0xf24365f7 in TConvertingBranchElement::GetEntry ()
>>> from
>>> /share/osg/app/atlas_app/atlas_rel/15.6.10/AtlasCore/15.6.10/InstallArea/i686-slc5-gcc43-opt/lib/libRootConversions.so
>>>
>>> #10 0xf2435e52 in TConvertingBranchElement::ReadSubBranches ()
>>> from
>>> /share/osg/app/atlas_app/atlas_rel/15.6.10/AtlasCore/15.6.10/InstallArea/i686-slc5-gcc43-opt/lib/libRootConversions.so
>>>
>>

> Received on Fri Sep 10 2010 - 23:14:42 CEST

This archive was generated by hypermail 2.2.0 : Fri Sep 10 2010 - 23:50:01 CEST