24#include <nlohmann/json.hpp>
1171 void Exec(unsigned int slot)
1173 fPerThreadResults[slot]++;
1176 // Called at the end of the event loop.
1179 *fFinalResult = std::accumulate(fPerThreadResults.begin(), fPerThreadResults.end(), 0);
1182 // Called by RDataFrame to retrieve the name of this action.
1183 std::string GetActionName() const { return "MyCounter"; }
1187 ROOT::RDataFrame df(10);
1188 ROOT::RDF::RResultPtr<int> resultPtr = df.Book<>(MyCounter{df.GetNSlots()}, {});
1189 // The GetValue call triggers the event loop
1190 std::cout << "Number of processed entries: " << resultPtr.GetValue() << std::endl;
1194See the Book() method for more information and [this tutorial](https://root.cern/doc/master/df018__customActions_8C.html)
1195for a more complete example.
1197#### Injecting arbitrary code in the event loop with Foreach() and ForeachSlot()
1199Foreach() takes a callable (lambda expression, free function, functor...) and a list of columns and
1200executes the callable on the values of those columns for each event that passes all upstream selections.
1201It can be used to perform actions that are not already available in the interface. For example, the following snippet
1202evaluates the root mean square of column "x":
1204// Single-thread evaluation of RMS of column "x" using Foreach
1207df.Foreach([&sumSq, &n](double x) { ++n; sumSq += x*x; }, {"x"});
1208std::cout << "rms of x: " << std::sqrt(sumSq / n) << std::endl;
1210In multi-thread runs, users are responsible for the thread-safety of the expression passed to Foreach():
1211thread will execute the expression concurrently.
1212The code above would need to employ some resource protection mechanism to ensure non-concurrent writing of `rms`; but
1213this is probably too much head-scratch for such a simple operation.
1215ForeachSlot() can help in this situation. It is an alternative version of Foreach() for which the function takes an
1216additional "processing slot" parameter besides the columns it should be applied to. RDataFrame
1217guarantees that ForeachSlot() will invoke the user expression with different `slot` parameters for different concurrent
1218executions (see [Special helper columns: rdfentry_ and rdfslot_](\ref helper-cols) for more information on the slot parameter).
1219We can take advantage of ForeachSlot() to evaluate a thread-safe root mean square of column "x":
1221// Thread-safe evaluation of RMS of column "x" using ForeachSlot
1222ROOT::EnableImplicitMT();
1223const unsigned int nSlots = df.GetNSlots();
1224std::vector<double> sumSqs(nSlots, 0.);
1225std::vector<unsigned int> ns(nSlots, 0);
1227df.ForeachSlot([&sumSqs, &ns](unsigned int slot, double x) { sumSqs[slot] += x*x; ns[slot] += 1; }, {"x"});
1228double sumSq = std::accumulate(sumSqs.begin(), sumSqs.end(), 0.); // sum all squares
1229unsigned int n = std::accumulate(ns.begin(), ns.end(), 0); // sum all counts
1230std::cout << "rms of x: " << std::sqrt(sumSq / n) << std::endl;
1232Notice how we created one `double` variable for each processing slot and later merged their results via `std::accumulate`.
1237Friend TTrees are supported by RDataFrame.
1238Friend TTrees with a TTreeIndex are supported starting from ROOT v6.24.
1240To use friend trees in RDataFrame, it is necessary to add the friends directly to
1241the tree and instantiate an RDataFrame with the main tree:
1246t.AddFriend(&ft, "myFriend");
1249auto f = d.Filter("myFriend.MyCol == 42");
1252Columns coming from the friend trees can be referred to by their full name, like in the example above,
1253or the friend tree name can be omitted in case the column name is not ambiguous (e.g. "MyCol" could be used instead of
1254 "myFriend.MyCol" in the example above).
1257\anchor other-file-formats
1258### Reading data formats other than ROOT trees
1259RDataFrame can be interfaced with RDataSources. The ROOT::RDF::RDataSource interface defines an API that RDataFrame can use to read arbitrary columnar data formats.
1261RDataFrame calls into concrete RDataSource implementations to retrieve information about the data, retrieve (thread-local) readers or "cursors" for selected columns
1262and to advance the readers to the desired data entry.
1263Some predefined RDataSources are natively provided by ROOT such as the ROOT::RDF::RCsvDS which allows to read comma separated files:
1265auto tdf = ROOT::RDF::FromCSV("MuRun2010B.csv");
1266auto filteredEvents =
1267 tdf.Filter("Q1 * Q2 == -1")
1268 .Define("m", "sqrt(pow(E1 + E2, 2) - (pow(px1 + px2, 2) + pow(py1 + py2, 2) + pow(pz1 + pz2, 2)))");
1269auto h = filteredEvents.Histo1D("m");
1273See also FromNumpy (Python-only), FromRNTuple(), FromArrow(), FromSqlite().
1276### Computation graphs (storing and reusing sets of transformations)
1278As we saw, transformed dataframes can be stored as variables and reused multiple times to create modified versions of the dataset. This implicitly defines a **computation graph** in which
1279several paths of filtering/creation of columns are executed simultaneously, and finally aggregated results are produced.
1281RDataFrame detects when several actions use the same filter or the same defined column, and **only evaluates each
1282filter or defined column once per event**, regardless of how many times that result is used down the computation graph.
1283Objects read from each column are **built once and never copied**, for maximum efficiency.
1284When "upstream" filters are not passed, subsequent filters, temporary column expressions and actions are not evaluated,
1285so it might be advisable to put the strictest filters first in the graph.
1287\anchor representgraph
1288### Visualizing the computation graph
1289It is possible to print the computation graph from any node to obtain a [DOT (graphviz)](https://en.wikipedia.org/wiki/DOT_(graph_description_language)) representation either on the standard output
1292Invoking the function ROOT::RDF::SaveGraph() on any node that is not the head node, the computation graph of the branch
1293the node belongs to is printed. By using the head node, the entire computation graph is printed.
1295Following there is an example of usage:
1297// First, a sample computational graph is built
1298ROOT::RDataFrame df("tree", "f.root");
1300auto df2 = df.Define("x", []() { return 1; })
1301 .Filter("col0 % 1 == col0")
1302 .Filter([](int b1) { return b1 <2; }, {"cut1"})
1303 .Define("y", []() { return 1; });
1305auto count = df2.Count();
1307// Prints the graph to the rd1.dot file in the current directory
1308ROOT::RDF::SaveGraph(df, "./mydot.dot");
1309// Prints the graph to standard output
1310ROOT::RDF::SaveGraph(df);
1313The generated graph can be rendered using one of the graphviz filters, e.g. `dot`. For instance, the image below can be generated with the following command:
1315$ dot -Tpng computation_graph.dot -ocomputation_graph.png
1318\image html RDF_Graph2.png
1321### Activating RDataFrame execution logs
1323RDataFrame has experimental support for verbose logging of the event loop runtimes and other interesting related information. It is activated as follows:
1325#include <ROOT/RLogger.hxx>
1327// this increases RDF's verbosity level as long as the `verbosity` variable is in scope
1328auto verbosity = ROOT::Experimental::RLogScopedVerbosity(ROOT::Detail::RDF::RDFLogChannel(), ROOT::Experimental::ELogLevel::kInfo);
1335verbosity = ROOT.Experimental.RLogScopedVerbosity(ROOT.Detail.RDF.RDFLogChannel(), ROOT.Experimental.ELogLevel.kInfo)
1338More information (e.g. start and end of each multi-thread task) is printed using `ELogLevel.kDebug` and even more
1339(e.g. a full dump of the generated code that RDataFrame just-in-time-compiles) using `ELogLevel.kDebug+10`.
1358 : RInterface(std::make_shared<
RDFDetail::RLoopManager>(nullptr, defaultColumns))
1361 auto msg =
"Invalid TDirectory!";
1362 throw std::runtime_error(msg);
1364 const std::string treeNameInt(treeName);
1365 auto tree =
static_cast<TTree *
>(dirPtr->
Get(treeNameInt.c_str()));
1367 auto msg =
"Tree \"" + treeNameInt +
"\" cannot be found!";
1368 throw std::runtime_error(msg);
1370 GetProxiedPtr()->SetTree(std::shared_ptr<TTree>(
tree, [](
TTree *) {}));
1385RDataFrame::RDataFrame(std::string_view treeName, std::string_view filenameglob,
const ColumnNames_t &defaultColumns)
1388 const std::string treeNameInt(treeName);
1389 const std::string filenameglobInt(filenameglob);
1391 chain->Add(filenameglobInt.c_str());
1410 std::string treeNameInt(treeName);
1412 for (
auto &
f : fileglobs)
1413 chain->Add(
f.c_str());
1475namespace Experimental {
1479 const nlohmann::json fullData = nlohmann::json::parse(std::ifstream(jsonFile));
1480 if (!fullData.contains(
"samples") || fullData[
"samples"].size() == 0) {
1481 throw std::runtime_error(
1482 R
"(The input specification does not contain any samples. Please provide the samples in the specification like:
1486 "trees": ["tree1", "tree2"],
1487 "files": ["file1.root", "file2.root"],
1488 "metadata": {"lumi": 1.0, }
1491 "trees": ["tree3", "tree4"],
1492 "files": ["file3.root", "file4.root"],
1493 "metadata": {"lumi": 0.5, }
1501 for (
const auto &keyValue : fullData[
"samples"].items()) {
1502 const std::string &sampleName = keyValue.key();
1503 const auto &sample = keyValue.value();
1506 if (!sample.contains(
"trees")) {
1507 throw std::runtime_error(
"A list of tree names must be provided for sample " + sampleName +
".");
1509 std::vector<std::string> trees = sample[
"trees"];
1510 if (!sample.contains(
"files")) {
1511 throw std::runtime_error(
"A list of files must be provided for sample " + sampleName +
".");
1513 std::vector<std::string> files = sample[
"files"];
1514 if (!sample.contains(
"metadata")) {
1518 for (
const auto &metadata : sample[
"metadata"].items()) {
1519 const auto &val = metadata.value();
1520 if (val.is_string())
1521 m.Add(metadata.key(), val.get<std::string>());
1522 else if (val.is_number_integer())
1523 m.Add(metadata.key(), val.get<
int>());
1524 else if (val.is_number_float())
1525 m.Add(metadata.key(), val.get<
double>());
1527 throw std::logic_error(
"The metadata keys can only be of type [string|int|double].");
1532 if (fullData.contains(
"friends")) {
1533 for (
const auto &friends : fullData[
"friends"].items()) {
1534 std::string alias = friends.key();
1535 std::vector<std::string> trees = friends.value()[
"trees"];
1536 std::vector<std::string> files = friends.value()[
"files"];
1537 if (files.size() != trees.size() && trees.size() > 1)
1538 throw std::runtime_error(
"Mismatch between trees and files in a friend.");
1543 if (fullData.contains(
"range")) {
1544 std::vector<int> range = fullData[
"range"];
1546 if (range.size() == 1)
1548 else if (range.size() == 2)
1565 auto *
tree = df.GetTree();
1566 auto defCols = df.GetDefaultColumnNames();
1568 std::ostringstream ret;
1570 ret <<
"A data frame built on top of the " <<
tree->GetName() <<
" dataset.";
1571 if (!defCols.empty()) {
1572 if (defCols.size() == 1)
1573 ret <<
"\nDefault column: " << defCols[0];
1575 ret <<
"\nDefault columns:\n";
1576 for (
auto &&col : defCols) {
1577 ret <<
" - " << col <<
"\n";
1582 ret <<
"A data frame associated to the data source \"" << cling::printValue(ds) <<
"\"";
1584 ret <<
"An empty data frame that will create " << df.GetNEmptyEntries() <<
" entries\n";
unsigned long long ULong64_t
The head node of a RDF computation graph.
A dataset specification for RDataFrame.
RDatasetSpec & WithGlobalFriends(const std::string &treeName, const std::string &fileNameGlob, const std::string &alias="")
RDatasetSpec & AddSample(RSample sample)
RDatasetSpec & WithGlobalRange(const RDatasetSpec::REntryRange &entryRange={})
Class representing a sample (grouping of trees (and their fileglobs) and (optional) metadata)
RDataSource * fDataSource
Non-owning pointer to a data-source object. Null if no data-source. RLoopManager has ownership of the...
RDFDetail::RLoopManager * GetLoopManager() const
const std::shared_ptr< RDFDetail::RLoopManager > & GetProxiedPtr() const
ROOT's RDataFrame offers a modern, high-level interface for analysis of data stored in TTree ,...
RDataFrame(std::string_view treeName, std::string_view filenameglob, const ColumnNames_t &defaultColumns={})
Build the dataframe.
ROOT::RDF::ColumnNames_t ColumnNames_t
@ kWithoutGlobalRegistration
Describe directory structure in memory.
virtual TObject * Get(const char *namecycle)
Return pointer to object identified by namecycle.
A TTree represents a columnar dataset.
ROOT::RDataFrame FromSpec(const std::string &jsonFile)
Factory method to create an RDataFrame from a JSON specification file.
std::vector< std::string > ColumnNames_t
This file contains a specialised ROOT message handler to test for diagnostic in unit tests.
std::shared_ptr< const ColumnNames_t > ColumnNamesPtr_t