Re: Manage a ROOT file from a streaming of the ROOT file content

From: Fons Rademakers <Fons.Rademakers_at_cern.ch>
Date: Thu, 19 Apr 2012 14:15:38 +0200


Hi Hassen,

    this seems a lot of work to reproduce the PROOF functionality we already have in ROOT. PROOF is basically ROOT's version of MapReduce employing many of the same techniques and many additional ones like life feedback and interactive operation.

HDFS is written for the efficient streaming of flat log files which can be trivially cat'ed, while ROOT Tree files are complex and optimize for selective (vertical) access to speedup complex data mining queries.

HDFS supports seeking in read/only files so our HDFS plugin works fine, but we don't specially benefit from anything HDFS might offer. A bunch of ROOT files spread on an xrootd data cluster and chained together will work just as efficient as HDFS in that respect.

Cheers, Fons.

On 19/04/2012 12:42, Hassen Riahi wrote:
> Hi Charles and Philippe,
>
>> To add to Phillipe's point, in Hadoop MapReduce,
>
> The answer to the related question asked by Philippe is that we are trying
> to use MapReduce framework which pass the input data to map via a PIPE.
>
>> you may pass the meta-information for the file to a streaming or pipes
>> c++ job, then have it have it operate on the file directly.
>
> ok there is 2 ways to do this:
>
> I- Give as input a txt file including the path of root files. But in this
> way we will not benefit from the Hadoop data locality optimization since
> the jobs will execute where the txt files reside and not where the root files
> II- Give as input the root files. That is what we are trying to achieve.
> (we are trying to make this solution working as optimal as possible )
>> That is,
>> 1. Assume that there is some Java or C++ MapReduce driver -- your map
>> method -- that is mapping across a collection of root files.
>> 2. You may obtain the path of an input file within the map task, and
>> assuming that this is stored in HDFS,
>
> It possible with (I) and (II).
> In (I), it is easy to achieve since the input stream of the txt files is
> exactly the path of the root files. The cons as said above the job will not
> benefit of Hadoop data locality optimization.
> In (II), the binary stream of the root file will not be used by map. The
> map (C++ root application), gets somehow the path of the root files, will
> open and read the files. The pros of this solution is that the root files
> are opened and read 2 times. The first one by mapreduce framework and the
> second one by the map
>
>> use the hdfs package to copy the file to a temporary work directory. The
>> job.local.dir is I believe the default location that is for scratch.
>
> so it requires to transfer of the input file to the worker node. I would,
> if possible, avoid this since it can imply a lot of traffic inside the cluster
>
>> 3. You then would pass this path to your C++ root application (the
>> "worker") -- it could be pipes or streaming -- and the worker can then
>> write back to HDFS itself, or pass back a path that the mapper could then
>> use to write data back to HDFS or pass on to a reduce phase.
>> If you would like, I can endeavor to clean up some very ugly and
>> uncommented code on bitbucket and pass that link to you.
>
> yes sure. It would be great! I will scale test it and keep you informed
> about the results.
>
> cheers
> Hassen
>
>> C
>> On Apr 18, 2012, at 10:15 AM, Philippe Canal wrote:
>>
>>> Hi Hassen,
>>>
>>> > it seems that this syntax hdfs://adminNode/user/hassen/file.root did
>>> not work for us since we are using an old version of ROOT.
>>>
>>> Yes, the HDFS plugin was introduced only in v5.26 (and may need to be
>>> explicitly requested when running the configure command).
>>>
>>> > Between, If I understand, the HDFS plugin of ROOT allows just to
>>> bypass FUSE when reading from HDFS
>>>
>>> Yes.
>>>
>>> > and not to read a ROOT file as an input stream. is it right?
>>>
>>> Yes, the current implementation of TFile and TTree *requires* random
>>> access to the file and thus can not be coming directly from an input stream.
>>>
>>> > Since I expect that the usage of TMemFile in production will require
>>> an unreasonable amount of RAM.
>>>
>>> A related question though. Why do you really need to pass the data via
>>> the pipe? In a similar environment (PROOF) rather than passing around
>>> the data from the controller to the controllee, we pass meta information
>>> (filename, treename, entry range) and the controllee/worker then access
>>> the file directly and efficiently.
>>>
>>> Cheers,
>>> Philippe.
>>>
>>> On 4/18/12 6:21 AM, Hassen Riahi wrote:
>>>>
>>>>> Hi Fons,
>>>>>
>>>>>> Hi Hassen,
>>>>>>
>>>>>> the HDFS plugin is readonly.
>>>>>
>>>>> Thanks for the clarification!
>>>>>
>>>>>> The idea is that you first copy the ROOT file onto HDFS and then
>>>>>> access if with ROOT.
>>>>>
>>>>> So it is possible to read directly (not through Fuse) a ROOT file
>>>>> stored onto HDFS. Please can you tell me the syntax? Here is for
>>>>> example an URI of a ROOT file in HDFS:
>>>>> hdfs://adminNode/user/hassen/file.root
>>>>
>>>> Sorry for the spam! Seeing
>>>> http://root.cern.ch/root/v526/Version526.news.html , it seems that this
>>>> syntax hdfs://adminNode/user/hassen/file.root did not work for us since
>>>> we are using an old version of ROOT.
>>>> Between, If I understand, the HDFS plugin of ROOT allows just to bypass
>>>> FUSE when reading from HDFS and not to read a ROOT file as an input
>>>> stream. is it right? if it is the case, there is not a backup solution
>>>> to TMemFile to read a root file as input stream? Since I expect that
>>>> the usage of TMemFile in production will require an unreasonable amount
>>>> of RAM.
>>>>
>>>> Am I missing something?
>>>>
>>>> Thanks for your help!
>>>> Hassen
>>>>
>>>>>
>>>>> cheers
>>>>> Hassen
>>>>>
>>>>>>
>>>>>> Cheers, Fons.
>>>>>>
>>>>>> On 18/04/2012 09:44, Hassen Riahi wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Another alternative is to try using the HDFS i/o plugin.
>>>>>>>
>>>>>>> Is there HDFS i/o plugin in ROOT? with which it is possible to write
>>>>>>> directly (not through Fuse) from ROOT to HDFS.
>>>>>>> If it is the case please point us to the documentation.
>>>>>>>
>>>>>>> cheers
>>>>>>> Hassen
>>>>>>>
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Philippe.
>>>>>>>>
>>>>>>>> On 4/17/12 10:29 AM, Massimiliano Fasi wrote:
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> In order to use Apache Hadoop with MapReduce streaming, we need a c++
>>>>>>>>> way to copy or cast a ROOT file passed through the standard input
>>>>>>>>> to any
>>>>>>>>> type of ROOT object (a TFile hopefully).
>>>>>>>>>
>>>>>>>>> Practically, we want to execute a command like
>>>>>>>>>
>>>>>>>>>> cat Myfile.root | MyAnalysisCode
>>>>>>>>>
>>>>>>>>> or
>>>>>>>>>
>>>>>>>>>> MyAnalysisCode < Myfile.root
>>>>>>>>>
>>>>>>>>> and then cast in MyAnalysisCode the standard input to something
>>>>>>>>> manageable by ROOT.
>>>>>>>>>
>>>>>>>>> Solutions we have tried so far didn't work. In particular, we tried to
>>>>>>>>> use the ifstream library but we wasn't able to cast their objects to
>>>>>>>>> something manageable by ROOT.
>>>>>>>>>
>>>>>>>>> Any hints or suggestions would be very appreciated.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Massimiliano
>>>>>>>>>
>>>>>>>>> ----------------------------------------------------------------
>>>>>>>>> This message was sent using IMP, the INFN Perugia Internet Messaging
>>>>>>>>> Program.
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Org: CERN, European Laboratory for Particle Physics.
>>>>>> Mail: 1211 Geneve 23, Switzerland
>>>>>> E-Mail: Fons.Rademakers_at_cern.ch <mailto:Fons.Rademakers_at_cern.ch>
>>>>>> Phone: +41 22 7679248
>>>>>> WWW: http://fons.rademakers.org <http://fons.rademakers.org/> Fax:
>>>>>> +41 22 7669640
>>>>>>
>>>>>
>>>>
>>
>

-- 
Org:    CERN, European Laboratory for Particle Physics.
Mail:   1211 Geneve 23, Switzerland
E-Mail: Fons.Rademakers_at_cern.ch              Phone: +41 22 7679248
WWW:    http://fons.rademakers.org           Fax:   +41 22 7669640


Received on Thu Apr 19 2012 - 14:15:51 CEST

This archive was generated by hypermail 2.2.0 : Fri Apr 20 2012 - 23:50:02 CEST