Logo ROOT  
Reference Guide
 
Loading...
Searching...
No Matches
df001_introduction.py
Go to the documentation of this file.
1## \file
2## \ingroup tutorial_dataframe
3## \notebook -nodraw
4## Basic usage of RDataFrame from python.
5##
6## This tutorial illustrates the basic features of the RDataFrame class,
7## a utility which allows to interact with data stored in TTrees following
8## a functional-chain like approach.
9##
10## \macro_code
11## \macro_output
12##
13## \date May 2017
14## \author Danilo Piparo (CERN)
15
16import ROOT
17
18# A simple helper function to fill a test tree: this makes the example stand-alone.
19def fill_tree(treeName, fileName):
20 df = ROOT.RDataFrame(10)
21 df.Define("b1", "(double) rdfentry_")\
22 .Define("b2", "(int) rdfentry_ * rdfentry_").Snapshot(treeName, fileName)
23
24# We prepare an input tree to run on
25fileName = "df001_introduction_py.root"
26treeName = "myTree"
27fill_tree(treeName, fileName)
28
29
30# We read the tree from the file and create a RDataFrame, a class that
31# allows us to interact with the data contained in the tree.
32d = ROOT.RDataFrame(treeName, fileName)
33
34# Operations on the dataframe
35# We now review some *actions* which can be performed on the data frame.
36# All actions but ForEach return a TActionResultPtr<T>. The series of
37# operations on the data frame is not executed until one of those pointers
38# is accessed.
39# But first of all, let us we define now our cut-flow with two strings.
40# Filters can be expressed as strings. The content must be C++ code. The
41# name of the variables must be the name of the branches. The code is
42# just in time compiled.
43cutb1 = 'b1 < 5.'
44cutb1b2 = 'b2 % 2 && b1 < 4.'
45
46# `Count` action
47# The `Count` allows to retrieve the number of the entries that passed the
48# filters. Here we show how the automatic selection of the column kicks
49# in in case the user specifies none.
50entries1 = d.Filter(cutb1) \
51 .Filter(cutb1b2) \
52 .Count();
53
54print("%s entries passed all filters" %entries1.GetValue())
55
56entries2 = d.Filter("b1 < 5.").Count();
57print("%s entries passed all filters" %entries2.GetValue())
58
59# `Min`, `Max` and `Mean` actions
60# These actions allow to retrieve statistical information about the entries
61# passing the cuts, if any.
62b1b2_cut = d.Filter(cutb1b2)
63minVal = b1b2_cut.Min('b1')
64maxVal = b1b2_cut.Max('b1')
65meanVal = b1b2_cut.Mean('b1')
66nonDefmeanVal = b1b2_cut.Mean("b2")
67print("The mean is always included between the min and the max: %s <= %s <= %s" %(minVal.GetValue(), meanVal.GetValue(), maxVal.GetValue()))
68
69# `Histo1D` action
70# The `Histo1D` action allows to fill an histogram. It returns a TH1F filled
71# with values of the column that passed the filters. For the most common
72# types, the type of the values stored in the column is automatically
73# guessed.
74hist = d.Filter(cutb1).Histo1D('b1')
75print("Filled h %s times, mean: %s" %(hist.GetEntries(), hist.GetMean()))
76
77# Express your chain of operations with clarity!
78# We are discussing an example here but it is not hard to imagine much more
79# complex pipelines of actions acting on data. Those might require code
80# which is well organised, for example allowing to conditionally add filters
81# or again to clearly separate filters and actions without the need of
82# writing the entire pipeline on one line. This can be easily achieved.
83# We'll show this re-working the `Count` example:
84cutb1_result = d.Filter(cutb1);
85cutb1b2_result = d.Filter(cutb1b2);
86cutb1_cutb1b2_result = cutb1_result.Filter(cutb1b2)
87
88# Now we want to count:
89evts_cutb1_result = cutb1_result.Count()
90evts_cutb1b2_result = cutb1b2_result.Count()
91evts_cutb1_cutb1b2_result = cutb1_cutb1b2_result.Count()
92
93print("Events passing cutb1: %s" %evts_cutb1_result.GetValue())
94print("Events passing cutb1b2: %s" %evts_cutb1b2_result.GetValue())
95print("Events passing both: %s" %evts_cutb1_cutb1b2_result.GetValue())
96
97# Calculating quantities starting from existing columns
98# Often, operations need to be carried out on quantities calculated starting
99# from the ones present in the columns. We'll create in this example a third
100# column the values of which are the sum of the *b1* and *b2* ones, entry by
101# entry. The way in which the new quantity is defined is via a callable.
102# It is important to note two aspects at this point:
103# - The value is created on the fly only if the entry passed the existing
104# filters.
105# - The newly created column behaves as the one present on the file on disk.
106# - The operation creates a new value, without modifying anything. De facto,
107# this is like having a general container at disposal able to accommodate
108# any value of any type.
109# Let's dive in an example:
110entries_sum = d.Define('sum', 'b2 + b1') \
111 .Filter('sum > 4.2') \
112 .Count()
113print(entries_sum.GetValue())
ROOT's RDataFrame offers a high level interface for analyses of data stored in TTrees,...