Logo ROOT   6.12/07
Reference Guide
TDataFrame.cxx
Go to the documentation of this file.
1 // Author: Enrico Guiraud, Danilo Piparo CERN 12/2016
2 
3 /*************************************************************************
4  * Copyright (C) 1995-2016, Rene Brun and Fons Rademakers. *
5  * All rights reserved. *
6  * *
7  * For the licensing terms see $ROOTSYS/LICENSE. *
8  * For the list of contributors see $ROOTSYS/README/CREDITS. *
9  *************************************************************************/
10 
11 #include "ROOT/TDataFrame.hxx"
12 using namespace ROOT::Experimental;
13 
14 // clang-format off
15 /**
16 * \class ROOT::Experimental::TDataFrame
17 * \ingroup dataframe
18 * \brief ROOT's TDataFrame offers a high level interface for analyses of data stored in `TTree`s.
19 
20 In addition, multi-threading and other low-level optimisations allow users to exploit all the resources available
21 on their machines completely transparently.<br>
22 Skip to the [class reference](#reference) or keep reading for the user guide.
23 
24 In a nutshell:
25 ~~~{.cpp}
26 ROOT::EnableImplicitMT(); // Tell ROOT you want to go parallel
27 ROOT::Experimental::TDataFrame d("myTree", "file.root"); // Interface to TTree and TChain
28 auto myHisto = d.Histo1D("Branch_A"); // This happens in parallel!
29 myHisto->Draw();
30 ~~~
31 
32 Calculations are expressed in terms of a type-safe *functional chain of actions and transformations*, `TDataFrame` takes
33 care of their execution. The implementation automatically puts in place several low level optimisations such as
34 multi-thread parallelisation and caching.
35 The namespace containing the TDataFrame is ROOT::Experimental. This signals the fact that the interfaces may evolve in
36 time.
37 
38 \htmlonly
39 <a href="https://doi.org/10.5281/zenodo.260230"><img src="https://zenodo.org/badge/DOI/10.5281/zenodo.260230.svg"
40 alt="DOI"></a>
41 \endhtmlonly
42 
43 ## Table of Contents
44 - [Introduction](#introduction)
45 - [Crash course](#crash-course)
46 - [More features](#more-features)
47 - [Transformations](#transformations) -- manipulating data
48 - [Actions](#actions) -- getting results
49 - [Parallel execution](#parallel-execution) -- how to use it and common pitfalls
50 - [Class reference](#reference) -- most methods are implemented in the TInterface base class
51 
52 ## <a name="introduction"></a>Introduction
53 Users define their analysis as a sequence of operations to be performed on the data-frame object; the framework
54 takes care of the management of the loop over entries as well as low-level details such as I/O and parallelisation.
55 `TDataFrame` provides methods to perform most common operations required by ROOT analyses;
56 at the same time, users can just as easily specify custom code that will be executed in the event loop.
57 
58 `TDataFrame` is built with a *modular* and *flexible* workflow in mind, summarised as follows:
59 
60 1. **build a data-frame** object by specifying your data-set
61 2. **apply a series of transformations** to your data
62  1. **filter** (e.g. apply some cuts) or
63  2. **define** a new column (e.g. the result of an expensive computation on branches)
64 3. **apply actions** to the transformed data to produce results (e.g. fill a histogram)
65 
66 The following table shows how analyses based on `TTreeReader` and `TTree::Draw` translate to `TDataFrame`. Follow the
67 [crash course](#crash-course) to discover more idiomatic and flexible ways to express analyses with `TDataFrame`.
68 <table>
69 <tr>
70  <td>
71  <b>TTreeReader</b>
72  </td>
73  <td>
74  <b>ROOT::Experimental::TDataFrame</b>
75  </td>
76 </tr>
77 <tr>
78  <td>
79 ~~~{.cpp}
80 TTreeReader reader("myTree", file);
81 TTreeReaderValue<A_t> a(reader, "A");
82 TTreeReaderValue<B_t> b(reader, "B");
83 TTreeReaderValue<C_t> c(reader, "C");
84 while(reader.Next()) {
85  if(IsGoodEvent(a, b, c))
86  DoStuff(a, b, c);
87 }
88 ~~~
89  </td>
90  <td>
91 ~~~{.cpp}
92 ROOT::Experimental::TDataFrame d("myTree", file, {"A", "B", "C"});
93 d.Filter(IsGoodEvent).Foreach(DoStuff);
94 ~~~
95  </td>
96 </tr>
97 <tr>
98  <td>
99  <b>TTree::Draw</b>
100  </td>
101  <td>
102  <b>ROOT::Experimental::TDataFrame</b>
103  </td>
104 </tr>
105 <tr>
106  <td>
107 ~~~{.cpp}
108 TTree *t = static_cast<TTree*>(
109  file->Get("myTree")
110 );
111 t->Draw("var", "var > 2");
112 ~~~
113  </td>
114  <td>
115 ~~~{.cpp}
116 ROOT::Experimental::TDataFrame d("myTree", file);
117 auto h = d.Filter("var > 2").Histo1D("var");
118 ~~~
119  </td>
120 </tr>
121 </table>
122 
123 ## <a name="crash-course"></a> Crash course
124 All snippets of code presented in the crash course can be executed in the ROOT interpreter. Simply precede them with
125 ~~~{.cpp}
126 using namespace ROOT::Experimental; // TDataFrame's namespace
127 ~~~
128 which is omitted for brevity. The terms "column" and "branch" are used interchangeably.
129 
130 ### Creating a TDataFrame
131 TDataFrame's constructor is where the user specifies the dataset and, optionally, a default set of columns that
132 operations should work with. Here are the most common methods to construct a TDataFrame object:
133 ~~~{.cpp}
134 // single file -- all ctors are equivalent
135 TDataFrame d1("treeName", "file.root");
136 TFile *f = TFile::Open("file.root");
137 TDataFrame d2("treeName", f); // same as TTreeReader
138 TTree *t = nullptr;
139 f.GetObject("treeName", t);
140 TDataFrame d3(*t); // TTreeReader takes a pointer, TDF takes a reference
141 
142 // multiple files -- all ctors are equivalent
143 TDataFrame d3("myTree", {"file1.root", "file2.root"});
144 std::vector<std::string> files = {"file1.root", "file2.root"};
145 TDataFrame d3("myTree", files);
146 TDataFrame d4("myTree", "file*.root"); // see TRegexp's documentation for a list of valid regexes
147 TChain chain("myTree");
148 chain.Add("file1.root");
149 chain.Add("file2.root");
150 TDataFrame d3(chain);
151 ~~~
152 Additionally, users can construct a TDataFrame specifying just an integer number. This is the number of "events" that
153 will be generated by this TDataFrame.
154 ~~~{.cpp}
155 TDataFrame d(10); // a TDF with 10 entries (and no columns/branches, for now)
156 d.Foreach([] { static int i = 0; std::cout << i++ << std::endl; }); // silly example usage: count to ten
157 ~~~
158 This is useful to generate simple data-sets on the fly: the contents of each event can be specified via the `Define`
159 transformation (explained below). For example, we have used this method to generate Pythia events (with a `Define`
160 transformation) and write them to disk in parallel (with the `Snapshot` action).
161 
162 ### Programmatically get the list of column names
163 The list of column names available in the dataset can be obtained with the `GetColumnsNames` method:
164 ~~~{.cpp}
165 TDataFrame d("myTree", "file.root");
166 auto colNames = d.GetColumnNames();
167 for (auto &&colName : colNames) {
168  std::cout << colName << std::endl;
169  }
170 ~~~
171 
172 ### Filling a histogram
173 Let's now tackle a very common task, filling a histogram:
174 ~~~{.cpp}
175 // Fill a TH1D with the "MET" branch
176 TDataFrame d("myTree", "file.root");
177 auto h = d.Histo1D("MET");
178 h->Draw();
179 ~~~
180 The first line creates a `TDataFrame` associated to the `TTree` "myTree". This tree has a branch named "MET".
181 
182 `Histo1D` is an *action*; it returns a smart pointer (a `TResultProxy` to be precise) to a `TH1D` histogram filled
183 with the `MET` of all events. If the quantity stored in the branch is a collection (e.g. a vector or an array), the
184 histogram is filled with its elements.
185 
186 You can use the objects returned by actions as if they were pointers to the desired results. There are many other
187 possible [actions](#overview), and all their results are wrapped in smart pointers; we'll see why in a minute.
188 
189 ### Applying a filter
190 Let's say we want to cut over the value of branch "MET" and count how many events pass this cut. This is one way to do it:
191 ~~~{.cpp}
192 TDataFrame d("myTree", "file.root");
193 auto c = d.Filter("MET > 4.").Count();
194 std::cout << *c << std::endl;
195 ~~~
196 The filter string (which must contain a valid c++ expression) is applied to the specified branches for each event;
197 the name and types of the columns are inferred automatically. The string expression is required to return a `bool`
198 which signals whether the event passes the filter (`true`) or not (`false`).
199 
200 You can think of your data as "flowing" through the chain of calls, being transformed, filtered and finally used to
201 perform actions. Multiple `Filter` calls can be chained one after another.
202 
203 Using string filters is nice for simple things, but they are limited to specifying the equivalent of a single return
204 statement or the body of a lambda, so it's cumbersome to use strings with more complex filters. They also add a small
205 runtime overhead, as ROOT needs to just-in-time compile the string into C++ code. When more freedom is required or
206 runtime performance is very important, a C++ callable can be specified instead (a lambda in the following snippet,
207 but it can be any kind of function or even a functor class), together with a list of branch names.
208 This snippet is analogous to the one above:
209 ~~~{.cpp}
210 TDataFrame d("myTree", "file.root");
211 auto metCut = [](double x) { return x > 4.; }; // a c++11 lambda function checking "x > 4"
212 auto c = d.Filter(metCut, {"MET"}).Count();
213 std::cout << *c << std::endl;
214 ~~~
215 
216 An example of a more complex filter expressed as a string containing C++ code is shown below
217 
218 ~~~{.cpp}
219 TDataFrame d("myTree", "file.root");
220 auto df = d.Define("p", "std::array<double, 4> p{px, py, pz, E}; return p;")
221  .Filter("double p2 = 0.0; for (auto&& x : p) p2 += x*x; return sqrt(p2) < 10.0;");
222 ~~~
223 
224 The code snippet above defines a column `p` that is a fixed-size array using the component column names and then
225 filters on its magnitude by looping over its elements. It must be noted that the usage of strings to define columns
226 like the one above is a major advantage when using PyROOT. However, only constants and data coming from other columns
227 in the dataset can be involved in the code passed as a string. Local variables and functions cannot be used, since
228 the interpreter will not know how to find them. When capturing local state is necessary, a C++ callable can be used.
229 
230 More information on filters and how to use them to automatically generate cutflow reports can be found [below](#Filters).
231 
232 ### Defining custom columns
233 Let's now consider the case in which "myTree" contains two quantities "x" and "y", but our analysis relies on a derived
234 quantity `z = sqrt(x*x + y*y)`. Using the `Define` transformation, we can create a new column in the data-set containing
235 the variable "z":
236 ~~~{.cpp}
237 TDataFrame d("myTree", "file.root");
238 auto sqrtSum = [](double x, double y) { return sqrt(x*x + y*y); };
239 auto zMean = d.Define("z", sqrtSum, {"x","y"}).Mean("z");
240 std::cout << *zMean << std::endl;
241 ~~~
242 `Define` creates the variable "z" by applying `sqrtSum` to "x" and "y". Later in the chain of calls we refer to
243 variables created with `Define` as if they were actual tree branches/columns, but they are evaluated on demand, at most
244 once per event. As with filters, `Define` calls can be chained with other transformations to create multiple custom
245 columns. `Define` and `Filter` transformations can be concatenated and intermixed at will.
246 
247 As with filters, it is possible to specify new columns as string expressions. This snippet is analogous to the one above:
248 ~~~{.cpp}
249 TDataFrame d("myTree", "file.root");
250 auto zMean = d.Define("z", "sqrt(x*x + y*y)").Mean("z");
251 std::cout << *zMean << std::endl;
252 ~~~
253 Again the names of the branches used in the expression and their types are inferred automatically. The string must be
254 valid c++ and is just-in-time compiled by the ROOT interpreter, cling -- the process has a small runtime overhead.
255 
256 Previously, when showing the different ways a TDataFrame can be created, we showed a constructor that only takes a
257 number of entries a parameter. In the following example we show how to combine such an "empty" `TDataFrame` with `Define`
258 transformations to create a data-set on the fly. We then save the generated data on disk using the `Snapshot` action.
259 ~~~{.cpp}
260 TDataFrame d(100); // a TDF that will generate 100 entries (currently empty)
261 int x = -1;
262 auto d_with_columns = d.Define("x", [&x] { return ++x; })
263  .Define("xx", [&x] { return x*x; });
264 d_with_columns.Snapshot("myNewTree", "newfile.root");
265 ~~~
266 This example is slightly more advanced than what we have seen so far: for starters, it makes use of lambda captures (a
267 simple way to make external variables available inside the body of c++ lambdas) to act on the same variable `x` from
268 both `Define` transformations. Secondly we have *stored* the transformed data-frame in a variable. This is always
269 possible: at each point of the transformation chain, users can store the status of the data-frame for further use (more
270 on this [below](#callgraphs)).
271 
272 You can read more about defining new columns [here](#custom-columns).
273 
274 \image html TDF_Graph.png "A graph composed of two branches, one starting with a filter and one with a define. The end point of a branch is always an action."
275 
276 ### Running on a range of entries
277 It is sometimes necessary to limit the processing of the dataset to a range of entries. For this reason, the TDataFrame
278 offers the concept of ranges as a node of the TDataFrame chain of transformations; this means that filters, columns and
279 actions can be concatenated to and intermixed with `Range`s. If a range is specified after a filter, the range will act
280 exclusively on the entries passing the filter -- it will not even count the other entries! The same goes for a `Range`
281 hanging from another `Range`. Here are some commented examples:
282 ~~~{.cpp}
283 TDataFrame d("myTree", "file.root");
284 // Here we store a data-frame that loops over only the first 30 entries in a variable
285 auto d30 = d.Range(30);
286 // This is how you pick all entries from 15 onwards
287 auto d15on = d.Range(15, 0);
288 // We can specify a stride too, in this case we pick an event every 3
289 auto d15each3 = d.Range(0, 15, 3);
290 ~~~
291 Note that ranges are not available when multi-threading is enabled. More information on ranges is available
292 [here](#ranges).
293 
294 ### Executing multiple actions in the same event loop
295 As a final example let us apply two different cuts on branch "MET" and fill two different histograms with the "pt\_v" of
296 the filtered events.
297 By now, you should be able to easily understand what's happening:
298 ~~~{.cpp}
299 TDataFrame d("treeName", "file.root");
300 auto h1 = d.Filter("MET > 10").Histo1D("pt_v");
301 auto h2 = d.Histo1D("pt_v");
302 h1->Draw(); // event loop is run once here
303 h2->Draw("SAME"); // no need to run the event loop again
304 ~~~
305 `TDataFrame` executes all above actions by **running the event-loop only once**. The trick is that actions are not
306 executed at the moment they are called, but they are **lazy**, i.e. delayed until the moment one of their results is
307 accessed through the smart pointer. At that time, the event loop is triggered and *all* results are produced
308 simultaneously.
309 
310 It is therefore good practice to declare all your transformations and actions *before* accessing their results, allowing
311 `TDataFrame` to run the loop once and produce all results in one go.
312 
313 ### Going parallel
314 Let's say we would like to run the previous examples in parallel on several cores, dividing events fairly between cores.
315 The only modification required to the snippets would be the addition of this line *before* constructing the main
316 data-frame object:
317 ~~~{.cpp}
318 ROOT::EnableImplicitMT();
319 ~~~
320 Simple as that. More details are given [below](#parallel-execution).
321 
322 ## <a name="more-features"></a>More features
323 Here is a list of the most important features that have been omitted in the "Crash course" for brevity.
324 You don't need to read all these to start using `TDataFrame`, but they are useful to save typing time and runtime.
325 
326 ### Treatment of columns holding collections
327 When using TDataFrame to read data from a ROOT file, users can specify that the type of a branch is `TArrayBranch<T>` to indicate the branch is a c-style array, an STL array or any other collection type associated to a contiguous storage in memory.
328 
329 Column values of type `TArrayBranch<T>` perform no copy of the underlying array data, it's in some sense a view, and offer a minimal array-like interface to access the array elements: either via square brackets, or with range-based for loops.
330 
331 The `TArrayBranch<T>` type signals to TDataFrame that a special behaviour needs to be adopted when snapshotting a dataset on disk. Indeed, if columns which are variable size C arrays are treated via the `TArrayBranch<T>`, TDataFrame will correctly persistify them - if anything else is adopted, for example `std::span`, only the first element of the array will be written.
332 
333 ### Callbacks
334 Acting on a TResultProxy, it is possible to register a callback that TDataFrame will execute "everyNEvents" on a partial result.
335 
336 The callback must be a callable that takes a reference to the result type as argument and returns nothing.
337 TDataFrame, acting as a full fledged data processing framework, will invoke registered callbacks passing partial action results as arguments to them (e.g. a histogram filled with a part of the selected events).
338 
339 Callbacks can be used e.g. to inspect partial results of the analysis while the event loop is running. For
340 example one can draw an up-to-date version of a result histogram every 100 entries like this:
341 ~~~{.cpp}
342 auto h = tdf.Histo1D("x");
343 TCanvas c("c","x hist");
344 h.OnPartialResult(100, [&c](TH1D &h_) { c.cd(); h_.Draw(); c.Update(); });
345 h->Draw(); // event loop runs here, this `Draw` is executed after the event loop is finished
346 ~~~
347 
348 ### Default branch lists
349 When constructing a `TDataFrame` object, it is possible to specify a **default column list** for your analysis, in the
350 usual form of a list of strings representing branch/column names. The default column list will be used as a fallback
351 whenever a list specific to the transformation/action is not present. TDataFrame will take as many of these columns as
352 needed, ignoring trailing extra names if present.
353 ~~~{.cpp}
354 // use "b1" and "b2" as default branches
355 TDataFrame d1("myTree", "file.root", {"b1","b2"});
356 auto h = d1.Filter([](int b1, int b2) { return b1 > b2; }) // will act on "b1" and "b2"
357  .Histo1D(); // will act on "b1"
358 
359 // just one default branch this time
360 TDataFrame d2("myTree", "file.root", {"b1"});
361 auto min = d2.Filter([](double b2) { return b2 > 0; }, {"b2"}) // we can still specify non-default branch lists
362  .Min(); // returns the minimum value of "b1" for the filtered entries
363 ~~~
364 
365 ### Branch type guessing and explicit declaration of branch types
366 C++ is a statically typed language: all types must be known at compile-time. This includes the types of the `TTree`
367 branches we want to work on. For filters, temporary columns and some of the actions, **branch types are deduced from the
368 signature** of the relevant filter function/temporary column expression/action function:
369 ~~~{.cpp}
370 // here b1 is deduced to be `int` and b2 to be `double`
371 dataFrame.Filter([](int x, double y) { return x > 0 && y < 0.; }, {"b1", "b2"});
372 ~~~
373 If we specify an incorrect type for one of the branches, an exception with an informative message will be thrown at
374 runtime, when the branch value is actually read from the `TTree`: `TDataFrame` detects type mismatches. The same would
375 happen if we swapped the order of "b1" and "b2" in the branch list passed to `Filter`.
376 
377 Certain actions, on the other hand, do not take a function as argument (e.g. `Histo1D`), so we cannot deduce the type of
378 the branch at compile-time. In this case **`TDataFrame` infers the type of the branch** from the `TTree` itself. This
379 is why we never needed to specify the branch types for all actions in the above snippets.
380 
381 When the branch type is not a common one such as `int`, `double`, `char` or `float` it is nonetheless good practice to
382 specify it as a template parameter to the action itself, like this:
383 ~~~{.cpp}
384 dataFrame.Histo1D("b1"); // OK, the type of "b1" is deduced at runtime
385 dataFrame.Min<MyNumber_t>("myObject"); // OK, "myObject" is deduced to be of type `MyNumber_t`
386 ~~~
387 
388 Deducing types at runtime requires the just-in-time compilation of the relevant actions, which has a small runtime
389 overhead, so specifying the type of the columns as template parameters to the action is good practice when performance is a goal.
390 
391 ### Generic actions
392 `TDataFrame` strives to offer a comprehensive set of standard actions that can be performed on each event. At the same
393 time, it **allows users to execute arbitrary code (i.e. a generic action) inside the event loop** through the `Foreach`
394 and `ForeachSlot` actions.
395 
396 `Foreach(f, columnList)` takes a function `f` (lambda expression, free function, functor...) and a list of columns, and
397 executes `f` on those columns for each event. The function passed must return nothing (i.e. `void`). It can be used to
398 perform actions that are not already available in the interface. For example, the following snippet evaluates the root
399 mean square of column "b":
400 ~~~{.cpp}
401 // Single-thread evaluation of RMS of column "b" using Foreach
402 double sumSq = 0.;
403 unsigned int n = 0;
404 TDataFrame d("bTree", bFilePtr);
405 d.Foreach([&sumSq, &n](double b) { ++n; sumSq += b*b; }, {"b"});
406 std::cout << "rms of b: " << std::sqrt(sumSq / n) << std::endl;
407 ~~~
408 When executing on multiple threads, users are responsible for the thread-safety of the expression passed to `Foreach`.
409 The code above would need to employ some resource protection mechanism to ensure non-concurrent writing of `rms`; but
410 this is probably too much head-scratch for such a simple operation.
411 
412 `ForeachSlot` can help in this situation. It is an alternative version of `Foreach` for which the function takes an
413 additional parameter besides the columns it should be applied to: an `unsigned int slot` parameter, where `slot` is a
414 number indicating which thread (0, 1, 2 , ..., poolSize - 1) the function is being run in. We can take advantage of
415 `ForeachSlot` to evaluate a thread-safe root mean square of branch "b":
416 ~~~{.cpp}
417 // Thread-safe evaluation of RMS of branch "b" using ForeachSlot
418 ROOT::EnableImplicitMT();
419 const unsigned int nSlots = ROOT::GetImplicitMTPoolSize();
420 std::vector<double> sumSqs(nSlots, 0.);
421 std::vector<unsigned int> ns(nSlots, 0);
422 
423 TDataFrame d("bTree", bFilePtr);
424 d.ForeachSlot([&sumSqs, &ns](unsigned int slot, double b) { sumSqs[slot] += b*b; ns[slot] += 1; }, {"b"});
425 double sumSq = std::accumulate(sumSqs.begin(), sumSqs.end(), 0.); // sum all squares
426 unsigned int n = std::accumulate(ns.begin(), ns.end(), 0); // sum all counts
427 std::cout << "rms of b: " << std::sqrt(sumSq / n) << std::endl;
428 ~~~
429 You see how we created one `double` variable for each thread in the pool, and later merged their results via
430 `std::accumulate`.
431 
432 ### Friend trees
433 Friend trees are supported by TDataFrame.
434 In order to deal with friend trees with TDataFrame, the user is required to build
435 the tree and its friends and instantiate a TDataFrame with it.
436 Two caveats are presents when using jitted `Define`s and `Filter`s:
437 1) the only columns which can be used in the strings passed to the aforementioned transformations are the top level branches of the friend trees.
438 2) the "friend columns" cannot be written with the notation involving a dot. For example, if a tree is created like this:
439 ~~~{.cpp}
440 TTree t([...]);
441 TTree ft([...]);
442 t.AddFriend(t,"myFriend");
443 ~~~
444 in order to access a certain column `col` of the tree ft, it will be necessary to alias it before. To continue the example:
445 ~~~{.cpp}
446 TDataFrame d(t);
447 d.Alias("myFriend_MyCol", "myFriend.MyCol");
448 auto f = d.Filter("myFriend_MyCol == 42");
449 ~~~
450 
451 ### Reading file formats different from ROOT's
452 TDataFrame can be interfaced with TDataSources. The TDataSource interface defines an API that TDataFrame can use to read arbitrary data formats.
453 
454 A concrete TDataSource implementation (i.e. a class that inherits from TDataSource and implements all of its pure
455 methods) provides an adaptor that TDataFrame can leverage to read any kind of tabular data formats.
456 TDataFrame calls into TDataSource to retrieve information about the data, retrieve (thread-local) readers or "cursors" for selected columns and to advance the readers to the desired data entry.
457 Some predefined TDataSources are natively provided by ROOT such as the `TCsvDS` which allows to read comma separated files:
458 ~~~{.cpp}
459 auto tdf = ROOT::Experimental::TDF::MakeCsvDataFrame("MuRun2010B.csv");
460 auto filteredEvents =
461  tdf.Filter("Q1 * Q2 == -1")
462  .Define("m", "sqrt(pow(E1 + E2, 2) - (pow(px1 + px2, 2) + pow(py1 + py2, 2) + pow(pz1 + pz2, 2)))");
463 auto h = filteredEvents.Histo1D("m");
464 h->Draw();
465 ~~~
466 
467 
468 ### <a name="callgraphs"></a>Call graphs (storing and reusing sets of transformations)
469 **Sets of transformations can be stored as variables** and reused multiple times to create **call graphs** in which
470 several paths of filtering/creation of columns are executed simultaneously; we often refer to this as "storing the
471 state of the chain".
472 
473 This feature can be used, for example, to create a temporary column once and use it in several subsequent filters or
474 actions, or to apply a strict filter to the data-set *before* executing several other transformations and actions,
475 effectively reducing the amount of events processed.
476 
477 Let's try to make this clearer with a commented example:
478 ~~~{.cpp}
479 // build the data-frame and specify a default column list
480 TDataFrame d(treeName, filePtr, {"var1", "var2", "var3"});
481 
482 // apply a cut and save the state of the chain
483 auto filtered = d.Filter(myBigCut);
484 
485 // plot branch "var1" at this point of the chain
486 auto h1 = filtered.Histo1D("var1");
487 
488 // create a new branch "vec" with a vector extracted from a complex object (only for filtered entries)
489 // and save the state of the chain
490 auto newBranchFiltered = filtered.Define("vec", [](const Obj& o) { return o.getVector(); }, {"obj"});
491 
492 // apply a cut and fill a histogram with "vec"
493 auto h2 = newBranchFiltered.Filter(cut1).Histo1D("vec");
494 
495 // apply a different cut and fill a new histogram
496 auto h3 = newBranchFiltered.Filter(cut2).Histo1D("vec");
497 
498 // Inspect results
499 h2->Draw(); // first access to an action result: run event-loop!
500 h3->Draw("SAME"); // event loop does not need to be run again here..
501 std::cout << "Entries in h1: " << h1->GetEntries() << std::endl; // ..or here
502 ~~~
503 `TDataFrame` detects when several actions use the same filter or the same temporary column, and **only evaluates each
504 filter or temporary column once per event**, regardless of how many times that result is used down the call graph.
505 Objects read from each column are **built once and never copied**, for maximum efficiency.
506 When "upstream" filters are not passed, subsequent filters, temporary column expressions and actions are not evaluated,
507 so it might be advisable to put the strictest filters first in the chain.
508 
509 ## <a name="transformations"></a>Transformations
510 ### <a name="Filters"></a> Filters
511 A filter is defined through a call to `Filter(f, columnList)`. `f` can be a function, a lambda expression, a functor
512 class, or any other callable object. It must return a `bool` signalling whether the event has passed the selection
513 (`true`) or not (`false`). It must perform "read-only" actions on the columns, and should not have side-effects (e.g.
514 modification of an external or static variable) to ensure correct results when implicit multi-threading is active.
515 
516 `TDataFrame` only evaluates filters when necessary: if multiple filters are chained one after another, they are executed
517 in order and the first one returning `false` causes the event to be discarded and triggers the processing of the next
518 entry. If multiple actions or transformations depend on the same filter, that filter is not executed multiple times for
519 each entry: after the first access it simply serves a cached result.
520 
521 #### <a name="named-filters-and-cutflow-reports"></a>Named filters and cutflow reports
522 An optional string parameter `name` can be passed to the `Filter` method to create a **named filter**. Named filters
523 work as usual, but also keep track of how many entries they accept and reject.
524 
525 Statistics are retrieved through a call to the `Report` method:
526 
527 - when `Report` is called on the main `TDataFrame` object, it prints stats for all named filters declared up to that
528 point
529 - when called on a specific node (e.g. the result of a `Define` or `Filter`), it prints stats for all named filters in
530 the section of the chain between the main `TDataFrame` and that node (included).
531 
532 Stats are printed in the same order as named filters have been added to the graph, and *refer to the latest event-loop*
533 that has been run using the relevant `TDataFrame`. If `Report` is called before the event-loop has been run at least
534 once, a run is triggered.
535 
536 ### <a name="ranges"></a>Ranges
537 When `TDataFrame` is not being used in a multi-thread environment (i.e. no call to `EnableImplicitMT` was made),
538 `Range` transformations are available. These act very much like filters but instead of basing their decision on
539 a filter expression, they rely on `start`,`stop` and `stride` parameters.
540 
541 - `start`: number of entries that will be skipped before starting processing again
542 - `stop`: maximum number of entries that will be processed
543 - `stride`: only process one entry every `stride` entries
544 
545 The actual number of entries processed downstream of a `Range` node will be `(stop - start)/stride` (or less if less
546 entries than that are available).
547 
548 Note that ranges act "locally", not based on the global entry count: `Range(10,50)` means "skip the first 10 entries
549 *that reach this node*, let the next 40 entries pass, then stop processing". If a range node hangs from a filter node,
550 and the range has a `start` parameter of 10, that means the range will skip the first 10 entries *that pass the
551 preceding filter*.
552 
553 Ranges allow "early quitting": if all branches of execution of a functional graph reached their `stop` value of
554 processed entries, the event-loop is immediately interrupted. This is useful for debugging and quick data explorations.
555 
556 ### <a name="custom-columns"></a> Custom columns
557 Custom columns are created by invoking `Define(name, f, columnList)`. As usual, `f` can be any callable object
558 (function, lambda expression, functor class...); it takes the values of the columns listed in `columnList` (a list of
559 strings) as parameters, in the same order as they are listed in `columnList`. `f` must return the value that will be
560 assigned to the temporary column.
561 
562 A new variable is created called `name`, accessible as if it was contained in the dataset from subsequent
563 transformations/actions.
564 
565 Use cases include:
566 - caching the results of complex calculations for easy and efficient multiple access
567 - extraction of quantities of interest from complex objects
568 - branch aliasing, i.e. changing the name of a branch
569 
570 An exception is thrown if the `name` of the new column/branch is already in use for another branch in the `TTree`.
571 
572 It is also possible to specify the quantity to be stored in the new temporary column as a C++ expression with the method
573 `Define(name, expression)`. For example this invocation
574 
575 ~~~{.cpp}
576 tdf.Define("pt", "sqrt(px*px + py*py)");
577 ~~~
578 
579 will create a new column called "pt" the value of which is calculated starting from the columns px and py. The system
580 builds a just-in-time compiled function starting from the expression after having deduced the list of necessary branches
581 from the names of the variables specified by the user.
582 
583 #### Custom columns as function of slot and entry number
584 
585 It is possible to create custom columns also as a function of the processing slot and entry numbers. The methods that can
586 be invoked are:
587 - `DefineSlot(name, f, columnList)`. In this case the callable f has this signature `R(unsigned int, T1, T2, ...)`: the
588 first parameter is the slot number which ranges from 0 to ROOT::GetImplicitMTPoolSize() - 1.
589 - `DefineSlotEntry(name, f, columnList)`. In this case the callable f has this signature `R(unsigned int, ULong64_t,
590 T1, T2, ...)`: the first parameter is the slot number while the second one the number of the entry being processed.
591 
592 ## <a name="actions"></a>Actions
593 ### Instant and lazy actions
594 Actions can be **instant** or **lazy**. Instant actions are executed as soon as they are called, while lazy actions are
595 executed whenever the object they return is accessed for the first time. As a rule of thumb, actions with a return value
596 are lazy, the others are instant.
597 
598 ### <a name="overview"></a>Overview
599 Here is a quick overview of what actions are present and what they do. Each one is described in more detail in the
600 reference guide.
601 
602 In the following, whenever we say an action "returns" something, we always mean it returns a smart pointer to it. Also
603 note that all actions are only executed for events that pass all preceding filters.
604 
605 | **Lazy actions** | **Description** |
606 |------------------|-----------------|
607 | Count | Return the number of events processed. |
608 | Fill | Fill a user-defined object with the values of the specified branches, as if by calling `Obj.Fill(branch1, branch2, ...). |
609 | Histo{1D,2D,3D} | Fill a {one,two,three}-dimensional histogram with the processed branch values. |
610 | Max | Return the maximum of processed branch values. If the type of the column is inferred, the return type is `double`, the type of the column otherwise.|
611 | Mean | Return the mean of processed branch values. If the type of the column is inferred, the return type is `double`, the type of the column otherwise.|
612 | Min | Return the minimum of processed branch values. If the type of the column is inferred, the return type is `double`, the type of the column otherwise.|
613 | Profile{1D,2D} | Fill a {one,two}-dimensional profile with the branch values that passed all filters. |
614 | Reduce | Reduce (e.g. sum, merge) entries using the function (lambda, functor...) passed as argument. The function must have signature `T(T,T)` where `T` is the type of the branch. Return the final result of the reduction operation. An optional parameter allows initialization of the result object to non-default values. |
615 | Take | Extract a column from the dataset as a collection of values. If the type of the column is a C-style array, the type stored in the return container is a `std::vector<T>` to guarantee the lifetime of the data involved. |
616 
617 | **Instant actions** | **Description** |
618 |---------------------|-----------------|
619 | Foreach | Execute a user-defined function on each entry. Users are responsible for the thread-safety of this lambda when executing with implicit multi-threading enabled. |
620 | ForeachSlot | Same as `Foreach`, but the user-defined function must take an extra `unsigned int slot` as its first parameter. `slot` will take a different value, `0` to `nThreads - 1`, for each thread of execution. This is meant as a helper in writing thread-safe `Foreach` actions when using `TDataFrame` after `ROOT::EnableImplicitMT()`. `ForeachSlot` works just as well with single-thread execution: in that case `slot` will always be `0`. |
621 | Snapshot | Writes processed data-set to disk, in a new `TTree` and `TFile`. Custom columns can be saved as well, filtered entries are not saved. Users can specify which columns to save (default is all). Snapshot, by default, overwrites the output file if it already exists. |
622 | Cache | Caches in contiguous memory columns' entries. Custom columns can be cached as well, filtered entries are not cached. Users can specify which columns to save (default is all). |
623 
624 
625 | **Queries** | **Description** |
626 |-----------|-----------------|
627 | Report | This is not properly an action, since when `Report` is called it does not book an operation to be performed on each entry. Instead, it interrogates the data-frame directly to print a cutflow report, i.e. statistics on how many entries have been accepted and rejected by the filters. See the section on [named filters](#named-filters-and-cutflow-reports) for a more detailed explanation. |
628 
629 ## <a name="parallel-execution"></a>Parallel execution
630 As pointed out before in this document, `TDataFrame` can transparently perform multi-threaded event loops to speed up
631 the execution of its actions. Users only have to call `ROOT::EnableImplicitMT()` *before* constructing the `TDataFrame`
632 object to indicate that it should take advantage of a pool of worker threads. **Each worker thread processes a distinct
633 subset of entries**, and their partial results are merged before returning the final values to the user.
634 
635 ### Thread safety
636 `Filter` and `Define` transformations should be inherently thread-safe: they have no side-effects and are not
637 dependent on global state.
638 Most `Filter`/`Define` functions will in fact be pure in the functional programming sense.
639 All actions are built to be thread-safe with the exception of `Foreach`, in which case users are responsible of
640 thread-safety, see [here](#generic-actions).
641 
642 <a name="reference"></a>
643 */
644 // clang-format on
645 
646 ////////////////////////////////////////////////////////////////////////////
647 /// \brief Build the dataframe
648 /// \param[in] treeName Name of the tree contained in the directory
649 /// \param[in] dirPtr TDirectory where the tree is stored, e.g. a TFile.
650 /// \param[in] defaultBranches Collection of default branches.
651 ///
652 /// The default branches are looked at in case no branch is specified in the
653 /// booking of actions or transformations.
654 /// See TInterface for the documentation of the
655 /// methods available.
656 TDataFrame::TDataFrame(std::string_view treeName, TDirectory *dirPtr, const ColumnNames_t &defaultBranches)
657  : TInterface<TDFDetail::TLoopManager>(std::make_shared<TDFDetail::TLoopManager>(nullptr, defaultBranches))
658 {
659  if (!dirPtr) {
660  auto msg = "Invalid TDirectory!";
661  throw std::runtime_error(msg);
662  }
663  const std::string treeNameInt(treeName);
664  auto tree = static_cast<TTree *>(dirPtr->Get(treeNameInt.c_str()));
665  if (!tree) {
666  auto msg = "Tree \"" + treeNameInt + "\" cannot be found!";
667  throw std::runtime_error(msg);
668  }
669  GetProxiedPtr()->SetTree(std::shared_ptr<TTree>(tree, [](TTree *) {}));
670 }
671 
672 ////////////////////////////////////////////////////////////////////////////
673 /// \brief Build the dataframe
674 /// \param[in] treeName Name of the tree contained in the directory
675 /// \param[in] filenameglob TDirectory where the tree is stored, e.g. a TFile.
676 /// \param[in] defaultBranches Collection of default branches.
677 ///
678 /// The default branches are looked at in case no branch is specified in the
679 /// booking of actions or transformations.
680 /// See TInterface for the documentation of the
681 /// methods available.
682 TDataFrame::TDataFrame(std::string_view treeName, std::string_view filenameglob, const ColumnNames_t &defaultBranches)
683  : TInterface<TDFDetail::TLoopManager>(std::make_shared<TDFDetail::TLoopManager>(nullptr, defaultBranches))
684 {
685  const std::string treeNameInt(treeName);
686  const std::string filenameglobInt(filenameglob);
687  auto chain = std::make_shared<TChain>(treeNameInt.c_str());
688  chain->Add(filenameglobInt.c_str());
689  GetProxiedPtr()->SetTree(chain);
690 }
691 
692 ////////////////////////////////////////////////////////////////////////////
693 /// \brief Build the dataframe
694 /// \param[in] treeName Name of the tree contained in the directory
695 /// \param[in] filenames Collection of file names
696 /// \param[in] defaultBranches Collection of default branches.
697 ///
698 /// The default branches are looked at in case no branch is specified in the booking of actions or transformations.
699 /// See TInterface for the documentation of the methods available.
700 TDataFrame::TDataFrame(std::string_view treeName, const std::vector<std::string> &filenames,
701  const ColumnNames_t &defaultBranches)
702  : TDF::TInterface<TDFDetail::TLoopManager>(std::make_shared<TDFDetail::TLoopManager>(nullptr, defaultBranches))
703 {
704  std::string treeNameInt(treeName);
705  auto chain = std::make_shared<TChain>(treeNameInt.c_str());
706  for (auto &fileName : filenames)
707  chain->Add(TDFInternal::ToConstCharPtr(fileName));
708  GetProxiedPtr()->SetTree(chain);
709 }
710 
711 ////////////////////////////////////////////////////////////////////////////
712 /// \brief Build the dataframe
713 /// \param[in] tree The tree or chain to be studied.
714 /// \param[in] defaultBranches Collection of default column names to fall back to when none is specified.
715 ///
716 /// The default branches are looked at in case no branch is specified in the
717 /// booking of actions or transformations.
718 /// See TInterface for the documentation of the
719 /// methods available.
720 TDataFrame::TDataFrame(TTree &tree, const ColumnNames_t &defaultBranches)
721  : TInterface<TDFDetail::TLoopManager>(std::make_shared<TDFDetail::TLoopManager>(&tree, defaultBranches))
722 {
723 }
724 
725 //////////////////////////////////////////////////////////////////////////
726 /// \brief Build a dataframe that generates numEntries entries.
727 /// \param[in] numEntries The number of entries to generate.
728 ///
729 /// An empty-source dataframe constructed with a number of entries will
730 /// generate those entries on the fly when some action is triggered,
731 /// and it will do so for all the previously-defined temporary branches.
733  : TInterface<TDFDetail::TLoopManager>(std::make_shared<TDFDetail::TLoopManager>(numEntries))
734 {
735 }
736 
737 //////////////////////////////////////////////////////////////////////////
738 /// \brief Build dataframe associated to datasource.
739 /// \param[in] ds The data-source object.
740 /// \param[in] defaultBranches Collection of default column names to fall back to when none is specified.
741 ///
742 /// A dataframe associated to a datasource will query it to access column values.
743 TDataFrame::TDataFrame(std::unique_ptr<TDataSource> ds, const ColumnNames_t &defaultBranches)
744  : TInterface<TDFDetail::TLoopManager>(std::make_shared<TDFDetail::TLoopManager>(std::move(ds), defaultBranches))
745 {
746 }
TDataFrame(std::string_view treeName, std::string_view filenameglob, const ColumnNames_t &defaultBranches={})
Build the dataframe.
Definition: TDataFrame.cxx:682
basic_string_view< char > string_view
Definition: RStringView.h:35
const std::shared_ptr< TDFDetail::TLoopManager > & GetProxiedPtr() const
TDFDetail::ColumnNames_t ColumnNames_t
Definition: TDataFrame.hxx:40
STL namespace.
const char * ToConstCharPtr(const char *s)
Definition: TDFUtils.cxx:212
unsigned long long ULong64_t
Definition: RtypesCore.h:70
Key/value store of objects.
Definition: TDirectory.hxx:70
Definition: tree.py:1
std::shared_ptr< ToContentType_t< T > > Get(std::string_view name)
Get the object for a key.
Definition: TDirectory.hxx:153