Re: Manage a ROOT file from a streaming of the ROOT file content

From: Hassen Riahi <hassen.riahi_at_pg.infn.it>
Date: Thu, 19 Apr 2012 12:42:13 +0200


Hi Charles and Philippe,

> To add to Phillipe's point, in Hadoop MapReduce,

The answer to the related question asked by Philippe is that we are trying to use MapReduce framework which pass the input data to map via a PIPE.

> you may pass the meta-information for the file to a streaming or
> pipes c++ job, then have it have it operate on the file directly.

ok there is 2 ways to do this:

I- Give as input a txt file including the path of root files. But in this way we will not benefit from the Hadoop data locality optimization since the jobs will execute where the txt files reside and not where the root files
II- Give as input the root files. That is what we are trying to achieve. (we are trying to make this solution working as optimal as possible )

> That is,
> 1. Assume that there is some Java or C++ MapReduce driver -- your
> map method -- that is mapping across a collection of root files.
> 2. You may obtain the path of an input file within the map task, and
> assuming that this is stored in HDFS,

It possible with (I) and (II).
In (I), it is easy to achieve since the input stream of the txt files is exactly the path of the root files. The cons as said above the job will not benefit of Hadoop data locality optimization. In (II), the binary stream of the root file will not be used by map. The map (C++ root application), gets somehow the path of the root files, will open and read the files. The pros of this solution is that the root files are opened and read 2 times. The first one by mapreduce framework and the second one by the map

> use the hdfs package to copy the file to a temporary work directory.
> The job.local.dir is I believe the default location that is for
> scratch.

so it requires to transfer of the input file to the worker node. I would, if possible, avoid this since it can imply a lot of traffic inside the cluster

> 3. You then would pass this path to your C++ root application (the
> "worker") -- it could be pipes or streaming -- and the worker can
> then write back to HDFS itself, or pass back a path that the mapper
> could then use to write data back to HDFS or pass on to a reduce
> phase.
> If you would like, I can endeavor to clean up some very ugly and
> uncommented code on bitbucket and pass that link to you.

yes sure. It would be great! I will scale test it and keep you informed about the results.

cheers
Hassen

> C
> On Apr 18, 2012, at 10:15 AM, Philippe Canal wrote:
>
>> Hi Hassen,
>>
>> > it seems that this syntax hdfs://adminNode/user/hassen/file.root
>> did not work for us since we are using an old version of ROOT.
>>
>> Yes, the HDFS plugin was introduced only in v5.26 (and may need to
>> be explicitly requested when running the configure command).
>>
>> > Between, If I understand, the HDFS plugin of ROOT allows just to
>> bypass FUSE when reading from HDFS
>>
>> Yes.
>>
>> > and not to read a ROOT file as an input stream. is it right?
>>
>> Yes, the current implementation of TFile and TTree *requires*
>> random access to the file and thus can not be coming directly from
>> an input stream.
>>
>> > Since I expect that the usage of TMemFile in production will
>> require an unreasonable amount of RAM.
>>
>> A related question though. Why do you really need to pass the
>> data via the pipe? In a similar environment (PROOF) rather than
>> passing around the data from the controller to the controllee, we
>> pass meta information (filename, treename, entry range) and the
>> controllee/worker then access the file directly and efficiently.
>>
>> Cheers,
>> Philippe.
>>
>> On 4/18/12 6:21 AM, Hassen Riahi wrote:
>>>
>>>
>>>> Hi Fons,
>>>>
>>>>> Hi Hassen,
>>>>>
>>>>> the HDFS plugin is readonly.
>>>>
>>>> Thanks for the clarification!
>>>>
>>>>> The idea is that you first copy the ROOT file onto HDFS and then
>>>>> access if with ROOT.
>>>>
>>>> So it is possible to read directly (not through Fuse) a ROOT file
>>>> stored onto HDFS. Please can you tell me the syntax? Here is for
>>>> example an URI of a ROOT file in HDFS: hdfs://adminNode/user/
>>>> hassen/file.root
>>>
>>> Sorry for the spam! Seeing http://root.cern.ch/root/v526/Version526.news.html
>>> , it seems that this syntax hdfs://adminNode/user/hassen/
>>> file.root did not work for us since we are using an old version of
>>> ROOT.
>>>
>>> Between, If I understand, the HDFS plugin of ROOT allows just to
>>> bypass FUSE when reading from HDFS and not to read a ROOT file as
>>> an input stream. is it right? if it is the case, there is not a
>>> backup solution to TMemFile to read a root file as input stream?
>>> Since I expect that the usage of TMemFile in production will
>>> require an unreasonable amount of RAM.
>>>
>>> Am I missing something?
>>>
>>> Thanks for your help!
>>> Hassen
>>>
>>>>
>>>> cheers
>>>> Hassen
>>>>
>>>>>
>>>>> Cheers, Fons.
>>>>>
>>>>> On 18/04/2012 09:44, Hassen Riahi wrote:
>>>>>> Hi,
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Another alternative is to try using the HDFS i/o plugin.
>>>>>>
>>>>>> Is there HDFS i/o plugin in ROOT? with which it is possible to
>>>>>> write
>>>>>> directly (not through Fuse) from ROOT to HDFS.
>>>>>> If it is the case please point us to the documentation.
>>>>>>
>>>>>> cheers
>>>>>> Hassen
>>>>>>
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Philippe.
>>>>>>>
>>>>>>> On 4/17/12 10:29 AM, Massimiliano Fasi wrote:
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> In order to use Apache Hadoop with MapReduce streaming, we
>>>>>>>> need a c++
>>>>>>>> way to copy or cast a ROOT file passed through the standard
>>>>>>>> input to any
>>>>>>>> type of ROOT object (a TFile hopefully).
>>>>>>>>
>>>>>>>> Practically, we want to execute a command like
>>>>>>>>
>>>>>>>>> cat Myfile.root | MyAnalysisCode
>>>>>>>>
>>>>>>>> or
>>>>>>>>
>>>>>>>>> MyAnalysisCode < Myfile.root
>>>>>>>>
>>>>>>>> and then cast in MyAnalysisCode the standard input to something
>>>>>>>> manageable by ROOT.
>>>>>>>>
>>>>>>>> Solutions we have tried so far didn't work. In particular, we
>>>>>>>> tried to
>>>>>>>> use the ifstream library but we wasn't able to cast their
>>>>>>>> objects to
>>>>>>>> something manageable by ROOT.
>>>>>>>>
>>>>>>>> Any hints or suggestions would be very appreciated.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Massimiliano
>>>>>>>>
>>>>>>>> ----------------------------------------------------------------
>>>>>>>> This message was sent using IMP, the INFN Perugia Internet
>>>>>>>> Messaging
>>>>>>>> Program.
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Org: CERN, European Laboratory for Particle Physics.
>>>>> Mail: 1211 Geneve 23, Switzerland
>>>>> E-Mail: Fons.Rademakers_at_cern.ch Phone: +41 22 7679248
>>>>> WWW: http://fons.rademakers.org Fax: +41 22 7669640
>>>>>
>>>>
>>>
>
Received on Thu Apr 19 2012 - 12:42:40 CEST

This archive was generated by hypermail 2.2.0 : Thu Apr 19 2012 - 17:50:01 CEST