You are here

Working with data sets

NB: A new dataset management model, designed to solve scalability issues experienced with large number of datasets, has been introduced in 5-34-05 with as concrete example implementation the case of ALICE. The interface seen by the user is basically unchanged; however, some important additions have been done; ALICE users are invited to read the dedicated documentation


This section describes how to the concept of dataset is translated in ROOT and how to work with datasets in PROOF. When performing repeated analysis of large amounts of data, the dataset PROOF interface allows to save a lot of time during the validation phase by caching the information needed by PROOF.

A dataset is basically a named list of file information. The basic ROOT class to work with is TFileCollection. ROOT provides a dedicated manager class TDataSetManager to handle TFileCollection's, and PROOF provides a full interface to the functionality offered by TDataSetManager.

To be used in PROOF a dataset needs first to be registered; a registered dataset can be verified and its information retrieved. The list of existing datasets can be browsed. More detailed information about the file distribution can also be retrieved. Finally a dataset can be removed.

Registered datasets can be referred by name in TProof::Process ; multiple datasets can be processed at once.

Content of this page:

  1. Naming conventions
  2. Dataset handling interface in TProof
    1. Registering a dataset
    2. Verifying a dataset
    3. Showing detailed information about a dataset
    4. Retrieving a copy of a dataset
    5. Browsing the existing datasets
    6. Removing a dataset
  3. Processing a dataset by name
  4. Processing many datasets at once
    1. Access the information about the current element

Naming conventions

If, in principle, one can refer to a dataset by a simple string, in reality it is convenient to use the string to uniquely identify a dataset and to pass more information about the way the dataset will be used. A naming convention has therefore been developed. The general form is the following:

Dataset names have the following form:

[[/group/]user/]dsname[#[subdir/]objname][?enl=entrylist][]
The first part allows to classify datasets at user and/or group level.

The part after the '#' is only relevant when using the dataset information; 'subdir' is an optional directory inside the file and 'objname' the name of the object to be used; 'entrylist' is either the name of an existing TEntryList object or the path to a file containing the TEntryList to be used.
A few examples:

"mydset" Analysis of the first TTree in the top directory of the dataset named 'mydset'
"mydset#T" Analysis of TTree 'T' in the top directory of the dataset named 'mydset'
"mydset#adir/T" Analysis of TTree 'T' in the 'adir' subdirectory of the dataset named 'mydset'
"mydset#adir/" Analysis of the first TTree in the 'adir' subdirectory of the dataset named 'mydset'
"mydset Analysis of the first TTree in the top directory of the dataset named 'mydset' filtered with the TEntryList 'mylist'
"mydset#adir/?enl=mylist.root" Analysis of the first TTree in the 'adir' subdirectory of the dataset named 'mydset' filtered with the TEntryList from file 'mylist.root'

Dataset handling interface in TProof

The basic API functionality will be described below. For the examples a local PROOF cluster has been used wit the files of the H1 example  available form the ROOT HTTP server. The TFileCollection objects used in the examples can be generated with the macro getCollection.C

Browsing the existing datasets

The first thing done when working with datasets is to browse the existing information: the method TProof::ShowDataSets does that. If we never used datastes before the list will be empty and we get somethign like this:

 $ root -l
root [0] TProof *p = TProof::Open("localhost")
...
root [1] p->ShowDataSets()
Dataset repository: /home/ganis/.proof/datasets
Dataset URI                               | # Files | Default tree | # Events |   Disk   | Staged
root [2]

The directory used for the repository is shown (default: 'datasets' in the master sandbox) together with the header of the listing. See examples below for less empty outputs.

Registering a dataset

The method TProof::RegisterDataSet is provided to register a dataset on the PROOF master. The method arguments are the name of the dataset, a pointer to the TFileCollection object describing the dataset and an option field. The option field is a string which can contain (a combination of, where meaningful) the following characters:

O Overwrite existing dataset withe same name, if any
U Update existing dataset with the information in the TFileCollection object; the new files are added at the end and duplicates are ignored
V Verify the dataset (in addition to registering)
T Trust the information contained in the dataset elements
S Verify the dataset serially (for ROOT >= 5.33/02)

Example: register the H1 files with name 'h1set'

root [1] .L getCollection.C+
root [2] TFileCollection *fch1 = getCollection("h1")
root [3] p->RegisterDataSet("h1set", fch1)
(Bool_t)1
root [4] p->ShowDataSets()
Dataset repository: /home/ganis/.proof/datasets
Dataset URI                               | # Files | Default tree | # Events |   Disk   | Staged
/default/ganis/h1set                      |       4 |        N/A              |   190 MB |    0 %

The dataset hase been registered under the default group 'default'; the number of files is correct, but remaining information missing or estimated. This will be filled after verification.

Verifying a dataset

Verification is the process of asserting the files of the dataset, finding out their exact location and their content (number of trees, entries for each tree, ...). For a verified dataset PROOF skips the validation step, which represents a significant gain if the number of files is large. For ROOT >= 5.33/02 verification is run in parallel by the workers with a dedicated TSelector (TSelVerifyDataSet). For older versions, verification was run on the master serially.

The method TProof::VerifyDataSet is provided to verify a dataset. The only meaningful arguments is the dataset name;  the string option field is ignored up to ROOT 5.32. For newer version it can contain the character 'S' to force serial verification.

Example: verify the dataset 'h1set'

root [6] p->VerifyDataSet("h1set")
Mst-0: 13:34:59 25816 Mst-0 | Info in <:scandataset>: opening 4 files that appear to be newly staged
Mst-0: 13:34:59 25816 Mst-0 | Info in <:scandataset>: processing 0.'new' file: http://root.cern.ch/files/h1/dstarmb.root
Mst-0: 13:34:59 25816 Mst-0 | Info in <:scandataset>: processing 1.'new' file: http://root.cern.ch/files/h1/dstarp1a.root
Mst-0: 13:34:59 25816 Mst-0 | Info in <:scandataset>: processing 2.'new' file: http://root.cern.ch/files/h1/dstarp1b.root
Mst-0: 13:34:59 25816 Mst-0 | Info in <:scandataset>: processing 3.'new' file: http://root.cern.ch/files/h1/dstarp2.root
Mst-0: 13:34:59 25816 Mst-0 | Info in <:scandataset>: 4 files 'new'; 0 files touched; 0 files disappeared
(Int_t)0
root [7] p->ShowDataSets()
Dataset repository: /home/ganis/.proof/datasets
Dataset URI                               | # Files | Default tree | # Events |   Disk   | Staged
/default/ganis/h1set                      |       4 | /h42         |   283813 |   264 MB |  100 %

Now the name of the TTree, the number of entries and the size of the files are correct.

Showing detailed information about a dataset

The details about a dataset can be viewed with the method TProof::ShowDataSet . The arguments are the dataset name and a string option field. The option field is a string which can contain a combination of the following characters:

M Display also meta data entries (default)
F Show details about all the files in the collection

Example: show all details about 'h1set'

root [11] p->ShowDataSet("h1set","MF")                                                                                                 
TFileCollection h1set -  contains: 4 files with a size of 277298426 bytes, 100.0 % staged - default tree name: '/h42'                  
The files contain the following trees:                                                                                                 
Tree /h42: 283813 events                                                                                                               
The collection contains the following files:                                                                                           
Collection name='THashList', class='THashList', size=4                                                                                 
 UUID: 604b445a-0ff7-11df-9717-0101007fbeef                                                                                            
MD5:  d41d8cd98f00b204e9800998ecf8427e                                                                                                 
Size: 21330730                                                                                                                         
 === URLs ===                                                                                                                          
 URL:  http://root.cern.ch/files/h1/dstarmb.root                                                                                       
 === Meta Data Object ===                                                                                                              
 Name:    /h42                                                                                                                         
 Class:   TTree
 Entries: 21920
 First:   0
 Last:    -1
 UUID: 6064b52a-0ff7-11df-9717-0101007fbeef
MD5:  d41d8cd98f00b204e9800998ecf8427e
Size: 71464503
 === URLs ===
 URL:  http://root.cern.ch/files/h1/dstarp1a.root
 === Meta Data Object ===
 Name:    /h42
 Class:   TTree
 Entries: 73243
 First:   0
 Last:    -1
 UUID: 608005c8-0ff7-11df-9717-0101007fbeef
MD5:  d41d8cd98f00b204e9800998ecf8427e
Size: 83827959
 === URLs ===
 URL:  http://root.cern.ch/files/h1/dstarp1b.root
 === Meta Data Object ===
 Name:    /h42
 Class:   TTree
 Entries: 85597
 First:   0
 Last:    -1
 UUID: 6096e838-0ff7-11df-9717-0101007fbeef
MD5:  d41d8cd98f00b204e9800998ecf8427e
Size: 100675234
 === URLs ===
 URL:  http://root.cern.ch/files/h1/dstarp2.root
 === Meta Data Object ===
 Name:    /h42
 Class:   TTree
 Entries: 103053
 First:   0
 Last:    -1
root [12]
root [12] p->ShowDataSet("h1set","")
TFileCollection h1set -  contains: 4 files with a size of 277298426 bytes, 100.0 % staged - default tree name: '/h42'

 

Retrieving a copy of a dataset

The dataset information, i.e. the corresponding TFileCollection object, can be retrieved with the method TProof::GetDataSet . The arguments are the dataset name and a string option field. The option can be used to select the subset of files served by a given or a list of servers: specify the server(s) (comma-separated) in the option filed.

Removing a dataset

A dataset can be removed with the method TProof::RemoveDataSet . Option filed is currently ignored.

Example: removing 'h1set'

root [13] p->RemoveDataSet("h1set")
(Int_t)0
root [14] p->ShowDataSets()
Dataset repository: /home/ganis/.proof/datasets
Dataset URI                               | # Files | Default tree | # Events |   Disk   | Staged

Processing datasets by name

Verified datasets can be referred-to by name in TProof::Process . The following example shows how to run the H1 example on the above introduced dataset 'h1set':

root [0] TProof *p = TProof::Open("localhost")
Starting master: opening connection ...
Starting master: OK
Opening connections to workers: OK (4 workers)
Setting up worker servers: OK (4 workers)
PROOF set to parallel mode (4 workers)
root [1] p->ShowDataSets()
Dataset repository: /home/ganis/.proof/datasets
Dataset URI                               | # Files | Default tree | # Events |   Disk   | Staged
/default/ganis/h1set                      |       4 | /h42         |   283813 |   264 MB |  100 %
root [2] p->Process("h1set", "tutorials/tree/h1analysis.C+")
Info in <:begin>: starting h1analysis with process option:
Looking up for exact location of files: OK (4 files)
Looking up for exact location of files: OK (4 files)
Validating files: OK (4 files)
Mst-0: merging output objects ... done
Mst-0: grand total: sent 4 objects, size: 5491 bytes
 FCN=-23769.9 FROM MIGRAD    STATUS=CONVERGED     214 CALLS         215 TOTAL
                     EDM=5.32355e-08    STRATEGY= 1  ERROR MATRIX UNCERTAINTY   1.7 per cent
  EXT PARAMETER                                   STEP         FIRST
  NO.   NAME      VALUE            ERROR          SIZE      DERIVATIVE
   1  p0           9.60009e+05   9.09409e+04   0.00000e+00  -1.03870e-08
   2  p1           3.51137e-01   2.33454e-02   0.00000e+00   2.83204e-02
   3  p2           1.18504e+03   5.74357e+01   0.00000e+00   2.75574e-06
   4  p3           1.45569e-01   5.50738e-05   0.00000e+00  -5.42172e-01
   5  p4           1.24391e-03   6.38933e-05   0.00000e+00  -1.56632e+00
                               ERR DEF= 0.5
(Long64_t)0

To process datasets which are registered but not verified one needs to force validation by setting the parameter PROOF_LookupOpt to 'all'

root [] p->SetParameter("PROOF_LookupOpt", "all") 
This is because, by default, PROOF assumes that the information in the TFileCollection object is valid, and an unverified dataset has all files marked as unstaged, so that no valid files are found in the collection.

Processing many datasets at once

Since the ROOT development version 5.27/02 it is possible to process more than one dataset in one go. There are two options: treat all the datasets as a unique dataset or process them sequentially giving the possibility to the user to keep the results separated. It is also possible to specify a text file from where the names of the datasets to be processed are to be read: the dataset names are specified on one or multiple lines; the lines found are joined as in grand-dataset case, unless the file path is followed by a ',' (e.g. p->Process("datasets.txt,",...)) in which case they are treated as in the keep-separated case; the file is open in raw mode with TFile::Open(...) and therefore it cane be remote, e.g. on a Web server.
The table summarizes options and syntax.

"dset1|dset2|..." Grand dataset  One set of results

"dset1 dset2 ..." or "dset1,dset2,..."

Keep datasets separated User can check dedicated bits in the current processed element in the selector to find out when processing of a new dataset starts
"datasets.txt" Grand dataset (read from file) The datasets to be processed are read from the text file datasets.txt; the lines are joined as in the grand-dataset case
"datasets.txt," Keep-separated (read from file) The datasets to be processed are read from the text file datasets.txt; the lines are joined as in the keep-separated case

To show this at work we use the getCollection.C macro to generate two datasets, 'h1seta' and 'h1setb', which together make the 'h1set' dataset.

root [2] TFileCollection *fch1a = getCollection("h1",1,2)
root [3] TFileCollection *fch1b = getCollection("h1",3,2)
root [4] p->RegisterDataSet("h1seta", fch1a, "V")
18:12:15 31667 Mst-0 | Info in <:scandataset>: opening 2 files that appear to be newly staged
18:12:15 31667 Mst-0 | Info in <:scandataset>: processing 0.'new' file: http://root.cern.ch/files/h1/dstarmb.root
18:12:15 31667 Mst-0 | Info in <:scandataset>: processing 1.'new' file: http://root.cern.ch/files/h1/dstarp1a.root
18:12:15 31667 Mst-0 | Info in <:scandataset>: 2 files 'new'; 0 files touched; 0 files disappeared
(Bool_t)1
root [5] p->RegisterDataSet("h1setb", fch1b, "V")
18:12:23 31667 Mst-0 | Info in <:scandataset>: opening 2 files that appear to be newly staged
18:12:23 31667 Mst-0 | Info in <:scandataset>: processing 0.'new' file: http://root.cern.ch/files/h1/dstarp1b.root
18:12:23 31667 Mst-0 | Info in <:scandataset>: processing 1.'new' file: http://root.cern.ch/files/h1/dstarp2.root
18:12:23 31667 Mst-0 | Info in <:scandataset>: 2 files 'new'; 0 files touched; 0 files disappeared
(Bool_t)1
root [6] p->ShowDataSets()
Dataset repository: /home/ganis/.proof/datasets
Dataset URI                               | # Files | Default tree | # Events |   Disk   | Staged
/default/ganis/h1set                      |       4 | /h42         |   283813 |   264 MB |  100 %
/default/ganis/h1seta                     |       2 | /h42         |    95163 |    88 MB |  100 %
/default/ganis/h1setb                     |       2 | /h42         |   188650 |   175 MB |  100 %

In the 'grand dataset' mode processing is fully equivalent to the case seen above:

root [7] p->Process("h1seta|h1setb", "tutorials/tree/h1analysis.C+")
Info in <:begin>: starting h1analysis with process option:
Looking up for exact location of files: OK (4 files)
Looking up for exact location of files: OK (4 files)
Validating files: OK (4 files)
Mst-0: merging output objects ... done
Mst-0: grand total: sent 4 objects, size: 5491 bytes
 FCN=-23769.9 FROM MIGRAD    STATUS=CONVERGED     214 CALLS         215 TOTAL
                     EDM=5.32355e-08    STRATEGY= 1  ERROR MATRIX UNCERTAINTY   1.7 per cent
  EXT PARAMETER                                   STEP         FIRST
  NO.   NAME      VALUE            ERROR          SIZE      DERIVATIVE
   1  p0           9.60009e+05   9.09409e+04   0.00000e+00  -1.03870e-08
   2  p1           3.51137e-01   2.33454e-02   0.00000e+00   2.83204e-02
   3  p2           1.18504e+03   5.74357e+01   0.00000e+00   2.75574e-06
   4  p3           1.45569e-01   5.50738e-05   0.00000e+00  -5.42172e-01
   5  p4           1.24391e-03   6.38933e-05   0.00000e+00  -1.56632e+00
                               ERR DEF= 0.5
(Long64_t)0

In the 'keep separated' mode the result is the same but we see notification for the two runs: each dataset is processed separately with its own packetizer:

root [8] p->Process("h1seta h1setb", "tutorials/tree/h1analysis.C+")
Info in : unmodified script has already been compiled and loaded
Info in <:begin>: starting h1analysis with process option:
Looking up for exact location of files: OK (2 files)
Looking up for exact location of files: OK (2 files)
Validating files: OK (2 files)
Looking up for exact location of files: OK (2 files)
Looking up for exact location of files: OK (2 files)
Validating files: OK (2 files)
Mst-0: merging output objects ... done
Mst-0: grand total: sent 4 objects, size: 5491 bytes
Warning in <:constructor>: Deleting canvas with same name: c1
 FCN=-23769.9 FROM MIGRAD    STATUS=CONVERGED     214 CALLS         215 TOTAL
                     EDM=5.32355e-08    STRATEGY= 1  ERROR MATRIX UNCERTAINTY   1.7 per cent
  EXT PARAMETER                                   STEP         FIRST
  NO.   NAME      VALUE            ERROR          SIZE      DERIVATIVE
   1  p0           9.60009e+05   9.09409e+04   0.00000e+00  -1.03870e-08
   2  p1           3.51137e-01   2.33454e-02   0.00000e+00   2.83204e-02
   3  p2           1.18504e+03   5.74357e+01   0.00000e+00   2.75574e-06
   4  p3           1.45569e-01   5.50738e-05   0.00000e+00  -5.42172e-01
   5  p4           1.24391e-03   6.38933e-05   0.00000e+00  -1.56632e+00
                               ERR DEF= 0.5
(Long64_t)0

Entry lists can be applied to each dataset following the syntax described above. In particular, the same dataset can be processed several time in a row with different entry lists.

Accessing the information about the current element

Since ROOT version 5.26/00 the packet being currently processed can be accessed in the selector via the input list. The packet is described by the class TDSetElement. In version 5.27/02 new information has been added to this class: bits TDSetElement::kNewRun and TDSetElement::kNewPacket to flag new runs (new dataset) and new packets, respectively; the name of the dataset being currently processed; a list of files possibly associated to the file being processed (for future use). The TDSetElement object if the value of a TPair named PROOF_CurrentElement available from the input list. The following is an example of how to use this information:

//_____________________________________________________________________
Bool_t mySelector::Process(Long64_t entry)
{
   // entry is the entry number in the current Tree

   // ...

   // Link to current element, if any
   TPair *elemPair = 0;
   if (fInput && (elemPair = dynamic_cast(fInput->FindObject("PROOF_CurrentElement")))) {
      TDSetElement *fCurrent = dynamic_cast(elemPair->Value());
      if (fCurrent) {
         if (fCurrent->TestBit(TDSetElement::kNewRun)) {
            Info("Process", "entry %lld: starting new run for dataset '%s'",
                             entry, fCurrent->GetDataSet());
         }
         if (fCurrent->TestBit(TDSetElement::kNewPacket)) {
            Info("Process", "entry %lld: new packet from: %s, first: %lld, last: %lld",
                             entry, fCurrent->GetName(), fCurrent->GetFirst(),
                             fCurrent->GetFirst()+fCurrent->GetNum()-1);
         }
      }
   }

   // ...
}