{ "cells": [ { "cell_type": "markdown", "id": "da29a309", "metadata": {}, "source": [ "# rf402_datahandling\n", "Data and categories: tools for manipulation of (un)binned datasets\n", "\n", "\n", "\n", "\n", "**Author:** Wouter Verkerke \n", "This notebook tutorial was automatically generated with ROOTBOOK-izer from the macro found in the ROOT repository on Tuesday, May 19, 2026 at 08:31 PM." ] }, { "cell_type": "code", "execution_count": 1, "id": "6a290aef", "metadata": { "collapsed": false, "execution": { "iopub.execute_input": "2026-05-19T20:31:32.588775Z", "iopub.status.busy": "2026-05-19T20:31:32.588667Z", "iopub.status.idle": "2026-05-19T20:31:32.602822Z", "shell.execute_reply": "2026-05-19T20:31:32.602330Z" } }, "outputs": [], "source": [ "%%cpp -d\n", "#include \"RooRealVar.h\"\n", "#include \"RooDataSet.h\"\n", "#include \"RooDataHist.h\"\n", "#include \"RooGaussian.h\"\n", "#include \"RooCategory.h\"\n", "#include \"TCanvas.h\"\n", "#include \"TAxis.h\"\n", "#include \"RooPlot.h\"\n", "#include \"TFile.h\"\n", "using namespace RooFit;" ] }, { "cell_type": "markdown", "id": "ba608f2e", "metadata": {}, "source": [ "Binned (RooDataHist) and unbinned datasets (RooDataSet) share\n", "many properties and inherit from a common abstract base class\n", "(RooAbsData), that provides an interface for all operations\n", "that can be performed regardless of the data format" ] }, { "cell_type": "code", "execution_count": 2, "id": "11078b4c", "metadata": { "collapsed": false, "execution": { "iopub.execute_input": "2026-05-19T20:31:32.604448Z", "iopub.status.busy": "2026-05-19T20:31:32.604336Z", "iopub.status.idle": "2026-05-19T20:31:32.954306Z", "shell.execute_reply": "2026-05-19T20:31:32.953614Z" } }, "outputs": [], "source": [ "RooRealVar x(\"x\", \"x\", -10, 10);\n", "RooRealVar y(\"y\", \"y\", 0, 40);\n", "RooCategory c(\"c\", \"c\");\n", "c.defineType(\"Plus\", +1);\n", "c.defineType(\"Minus\", -1);" ] }, { "cell_type": "markdown", "id": "80ccb6df", "metadata": {}, "source": [ "Basic Operations on unbinned datasets\n", "--------------------------------------------------------------" ] }, { "cell_type": "markdown", "id": "fdb8fd6b", "metadata": {}, "source": [ "RooDataSet is an unbinned dataset (a collection of points in N-dimensional space)" ] }, { "cell_type": "code", "execution_count": 3, "id": "e19993e3", "metadata": { "collapsed": false, "execution": { "iopub.execute_input": "2026-05-19T20:31:32.955986Z", "iopub.status.busy": "2026-05-19T20:31:32.955873Z", "iopub.status.idle": "2026-05-19T20:31:33.163705Z", "shell.execute_reply": "2026-05-19T20:31:33.163274Z" } }, "outputs": [], "source": [ "RooDataSet d(\"d\", \"d\", RooArgSet(x, y, c));" ] }, { "cell_type": "markdown", "id": "f02e0ab4", "metadata": {}, "source": [ "Unlike RooAbsArgs (RooAbsPdf,RooFormulaVar,....) datasets are not attached to\n", "the variables they are constructed from. Instead they are attached to an internal\n", "clone of the supplied set of arguments" ] }, { "cell_type": "markdown", "id": "6d43d4d3", "metadata": {}, "source": [ "Fill d with dummy values" ] }, { "cell_type": "code", "execution_count": 4, "id": "f150f12a", "metadata": { "collapsed": false, "execution": { "iopub.execute_input": "2026-05-19T20:31:33.176279Z", "iopub.status.busy": "2026-05-19T20:31:33.176127Z", "iopub.status.idle": "2026-05-19T20:31:33.386113Z", "shell.execute_reply": "2026-05-19T20:31:33.385630Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "DataStore d (d)\n", " Contains 1000 entries\n", " Observables: \n", " 1) x = 9 L(-10 - 10) \"x\"\n", " 2) y = 31.607 L(0 - 40) \"y\"\n", " 3) c = Plus(idx = 1)\n", " \"c\"\n", "\n" ] } ], "source": [ "Int_t i;\n", "for (i = 0; i < 1000; i++) {\n", " x = i / 50 - 10;\n", " y = sqrt(1.0 * i);\n", " c.setLabel((i % 2) ? \"Plus\" : \"Minus\");\n", "\n", " // We must explicitly refer to x,y,c here to pass the values because\n", " // d is not linked to them (as explained above)\n", " d.add(RooArgSet(x, y, c));\n", "}\n", "d.Print(\"v\");\n", "cout << endl;" ] }, { "cell_type": "markdown", "id": "01c79eb5", "metadata": {}, "source": [ "The get() function returns a pointer to the internal copy of the RooArgSet(x,y,c)\n", "supplied in the constructor" ] }, { "cell_type": "code", "execution_count": 5, "id": "ff1f857b", "metadata": { "collapsed": false, "execution": { "iopub.execute_input": "2026-05-19T20:31:33.387635Z", "iopub.status.busy": "2026-05-19T20:31:33.387508Z", "iopub.status.idle": "2026-05-19T20:31:33.595751Z", "shell.execute_reply": "2026-05-19T20:31:33.595245Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 1) 0x7f2531092ce0 RooRealVar:: x = 9 L(-10 - 10) \"x\"\n", " 2) 0x7f25311738c0 RooRealVar:: y = 31.607 L(0 - 40) \"y\"\n", " 3) 0x7f25310d8870 RooCategory:: c = Plus(idx = 1)\n", " \"c\"\n", "\n" ] } ], "source": [ "const RooArgSet *row = d.get();\n", "row->Print(\"v\");\n", "cout << endl;" ] }, { "cell_type": "markdown", "id": "03a3e2c4", "metadata": {}, "source": [ "Get with an argument loads a specific data point in row and returns\n", "a pointer to row argset. get() always returns the same pointer, unless\n", "an invalid row number is specified. In that case a null ptr is returned" ] }, { "cell_type": "code", "execution_count": 6, "id": "cef015a2", "metadata": { "collapsed": false, "execution": { "iopub.execute_input": "2026-05-19T20:31:33.597262Z", "iopub.status.busy": "2026-05-19T20:31:33.597152Z", "iopub.status.idle": "2026-05-19T20:31:33.805645Z", "shell.execute_reply": "2026-05-19T20:31:33.804861Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 1) 0x7f2531092ce0 RooRealVar:: x = 8 L(-10 - 10) \"x\"\n", " 2) 0x7f25311738c0 RooRealVar:: y = 30 L(0 - 40) \"y\"\n", " 3) 0x7f25310d8870 RooCategory:: c = Minus(idx = -1)\n", " \"c\"\n", "\n" ] } ], "source": [ "d.get(900)->Print(\"v\");\n", "cout << endl;" ] }, { "cell_type": "markdown", "id": "b926964c", "metadata": {}, "source": [ "Reducing, Appending and Merging\n", "-------------------------------------------------------------" ] }, { "cell_type": "markdown", "id": "1a6ca657", "metadata": {}, "source": [ "The reduce() function returns a new dataset which is a subset of the original" ] }, { "cell_type": "code", "execution_count": 7, "id": "59098c38", "metadata": { "collapsed": false, "execution": { "iopub.execute_input": "2026-05-19T20:31:33.807050Z", "iopub.status.busy": "2026-05-19T20:31:33.806940Z", "iopub.status.idle": "2026-05-19T20:31:34.013395Z", "shell.execute_reply": "2026-05-19T20:31:34.012986Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", ">> d1 has only columns x,c\n", "DataStore d (d)\n", " Contains 1000 entries\n", " Observables: \n", " 1) x = 9 L(-10 - 10) \"x\"\n", " 2) c = Plus(idx = 1)\n", " \"c\"\n", "\n", ">> d2 has only column y\n", "DataStore d (d)\n", " Contains 1000 entries\n", " Observables: \n", " 1) y = 31.607 L(0 - 40) \"y\"\n", "\n", ">> d3 has only the points with y>5.17\n", "DataStore d (d)\n", " Contains 973 entries\n", " Observables: \n", " 1) x = 9 L(-10 - 10) \"x\"\n", " 2) y = 31.607 L(0 - 40) \"y\"\n", " 3) c = Plus(idx = 1)\n", " \"c\"\n", "\n", ">> d4 has only columns x,c for data points with y>5.17\n", "DataStore d (d)\n", " Contains 973 entries\n", " Observables: \n", " 1) x = 9 L(-10 - 10) \"x\"\n", " 2) c = Plus(idx = 1)\n", " \"c\"\n" ] } ], "source": [ "cout << endl << \">> d1 has only columns x,c\" << endl;\n", "std::unique_ptr d1{d.reduce({x, c})};\n", "d1->Print(\"v\");\n", "\n", "cout << endl << \">> d2 has only column y\" << endl;\n", "std::unique_ptr d2{d.reduce({y})};\n", "d2->Print(\"v\");\n", "\n", "cout << endl << \">> d3 has only the points with y>5.17\" << endl;\n", "std::unique_ptr d3{d.reduce(\"y>5.17\")};\n", "d3->Print(\"v\");\n", "\n", "cout << endl << \">> d4 has only columns x,c for data points with y>5.17\" << endl;\n", "std::unique_ptr d4{d.reduce({x, c}, \"y>5.17\")};\n", "d4->Print(\"v\");" ] }, { "cell_type": "markdown", "id": "c13be878", "metadata": {}, "source": [ "The merge() function adds two data set column-wise" ] }, { "cell_type": "code", "execution_count": 8, "id": "1141cadb", "metadata": { "collapsed": false, "execution": { "iopub.execute_input": "2026-05-19T20:31:34.014961Z", "iopub.status.busy": "2026-05-19T20:31:34.014849Z", "iopub.status.idle": "2026-05-19T20:31:34.220409Z", "shell.execute_reply": "2026-05-19T20:31:34.219950Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", ">> merge d2(y) with d1(x,c) to form d1(x,c,y)\n", "DataStore d (d)\n", " Contains 1000 entries\n", " Observables: \n", " 1) x = 9 L(-10 - 10) \"x\"\n", " 2) c = Plus(idx = 1)\n", " \"c\"\n", " 3) y = 31.607 L(0 - 40) \"y\"\n" ] } ], "source": [ "cout << endl << \">> merge d2(y) with d1(x,c) to form d1(x,c,y)\" << endl;\n", "static_cast(*d1).merge(&static_cast(*d2));\n", "d1->Print(\"v\");" ] }, { "cell_type": "markdown", "id": "15b07c56", "metadata": {}, "source": [ "The append() function adds two datasets row-wise" ] }, { "cell_type": "code", "execution_count": 9, "id": "dd0d7a78", "metadata": { "collapsed": false, "execution": { "iopub.execute_input": "2026-05-19T20:31:34.222003Z", "iopub.status.busy": "2026-05-19T20:31:34.221892Z", "iopub.status.idle": "2026-05-19T20:31:34.427642Z", "shell.execute_reply": "2026-05-19T20:31:34.427087Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", ">> append data points of d3 to d1\n", "DataStore d (d)\n", " Contains 1973 entries\n", " Observables: \n", " 1) x = 9 L(-10 - 10) \"x\"\n", " 2) c = Plus(idx = 1)\n", " \"c\"\n", " 3) y = 31.607 L(0 - 40) \"y\"\n" ] } ], "source": [ "cout << endl << \">> append data points of d3 to d1\" << endl;\n", "static_cast(*d1).append(static_cast(*d3));\n", "d1->Print(\"v\");" ] }, { "cell_type": "markdown", "id": "71343637", "metadata": {}, "source": [ "Operations on binned datasets\n", "---------------------------------------------------------" ] }, { "cell_type": "markdown", "id": "7560a700", "metadata": {}, "source": [ "A binned dataset can be constructed empty, from an unbinned dataset, or\n", "from a ROOT native histogram (TH1,2,3)" ] }, { "cell_type": "code", "execution_count": 10, "id": "34a58772", "metadata": { "collapsed": false, "execution": { "iopub.execute_input": "2026-05-19T20:31:34.429174Z", "iopub.status.busy": "2026-05-19T20:31:34.429058Z", "iopub.status.idle": "2026-05-19T20:31:34.637557Z", "shell.execute_reply": "2026-05-19T20:31:34.637027Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ ">> construct dh (binned) from d(unbinned) but only take the x and y dimensions,\n", ">> the category 'c' will be projected in the filling process\n" ] } ], "source": [ "cout << \">> construct dh (binned) from d(unbinned) but only take the x and y dimensions,\" << endl\n", " << \">> the category 'c' will be projected in the filling process\" << endl;" ] }, { "cell_type": "markdown", "id": "87218d30", "metadata": {}, "source": [ "The binning of real variables (like x,y) is done using their fit range\n", "'get/setRange()' and number of specified fit bins 'get/setBins()'.\n", "Category dimensions of binned datasets get one bin per defined category state" ] }, { "cell_type": "code", "execution_count": 11, "id": "0a135acc", "metadata": { "collapsed": false, "execution": { "iopub.execute_input": "2026-05-19T20:31:34.639053Z", "iopub.status.busy": "2026-05-19T20:31:34.638943Z", "iopub.status.idle": "2026-05-19T20:31:34.844929Z", "shell.execute_reply": "2026-05-19T20:31:34.844386Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "DataStore dh (binned version of d)\n", " Contains 100 entries\n", " Observables: \n", " 1) x = 9 L(-10 - 10) B(10) \"x\"\n", " 2) y = 38 L(0 - 40) B(10) \"y\"\n", "Binned Dataset dh (binned version of d)\n", " Contains 100 bins with a total weight of 1000\n", " Observables: 1) x = 9 L(-10 - 10) B(10) \"x\"\n", " 2) y = 38 L(0 - 40) B(10) \"y\"\n" ] } ], "source": [ "x.setBins(10);\n", "y.setBins(10);\n", "RooDataHist dh(\"dh\", \"binned version of d\", RooArgSet(x, y), d);\n", "dh.Print(\"v\");\n", "\n", "RooPlot *yframe = y.frame(Bins(10), Title(\"Operations on binned datasets\"));\n", "dh.plotOn(yframe); // plot projection of 2D binned data on y" ] }, { "cell_type": "markdown", "id": "af286b6b", "metadata": {}, "source": [ "Examine the statistics of a binned dataset" ] }, { "cell_type": "code", "execution_count": 12, "id": "ccde58fd", "metadata": { "collapsed": false, "execution": { "iopub.execute_input": "2026-05-19T20:31:34.846673Z", "iopub.status.busy": "2026-05-19T20:31:34.846544Z", "iopub.status.idle": "2026-05-19T20:31:35.052636Z", "shell.execute_reply": "2026-05-19T20:31:35.051674Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ ">> number of bins in dh : 100\n", ">> sum of weights in dh : 1000\n", ">> integral over histogram: 8000\n" ] } ], "source": [ "cout << \">> number of bins in dh : \" << dh.numEntries() << endl;\n", "cout << \">> sum of weights in dh : \" << dh.sum(false) << endl;\n", "cout << \">> integral over histogram: \" << dh.sum(true) << endl; // accounts for bin volume" ] }, { "cell_type": "markdown", "id": "ad9a1e5e", "metadata": {}, "source": [ "Locate a bin from a set of coordinates and retrieve its properties" ] }, { "cell_type": "code", "execution_count": 13, "id": "c18e7851", "metadata": { "collapsed": false, "execution": { "iopub.execute_input": "2026-05-19T20:31:35.054136Z", "iopub.status.busy": "2026-05-19T20:31:35.053972Z", "iopub.status.idle": "2026-05-19T20:31:35.259939Z", "shell.execute_reply": "2026-05-19T20:31:35.259335Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ ">> retrieving the properties of the bin enclosing coordinate (x,y) = (0.3,20.5) \n", " bin center:\n", " 1) 0x7f2531619120 RooRealVar:: x = 1 L(-10 - 10) B(10) \"x\"\n", " 2) 0x7f253133df80 RooRealVar:: y = 22 L(0 - 40) B(10) \"y\"\n", " weight = 76\n" ] } ], "source": [ "x = 0.3;\n", "y = 20.5;\n", "cout << \">> retrieving the properties of the bin enclosing coordinate (x,y) = (0.3,20.5) \" << endl;\n", "cout << \" bin center:\" << endl;\n", "dh.get(RooArgSet(x, y))->Print(\"v\"); // load bin center coordinates in internal buffer\n", "cout << \" weight = \" << dh.weight() << endl; // return weight of last loaded coordinates" ] }, { "cell_type": "markdown", "id": "e39c9721", "metadata": {}, "source": [ "Reduce the 2-dimensional binned dataset to a 1-dimensional binned dataset\n", "\n", "All reduce() methods are interfaced in RooAbsData. All reduction techniques\n", "demonstrated on unbinned datasets can be applied to binned datasets as well." ] }, { "cell_type": "code", "execution_count": 14, "id": "b21a7fa0", "metadata": { "collapsed": false, "execution": { "iopub.execute_input": "2026-05-19T20:31:35.261638Z", "iopub.status.busy": "2026-05-19T20:31:35.261488Z", "iopub.status.idle": "2026-05-19T20:31:35.467204Z", "shell.execute_reply": "2026-05-19T20:31:35.466745Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ ">> Creating 1-dimensional projection on y of dh for bins with x>0\n", "DataStore dh (binned version of d)\n", " Contains 10 entries\n", " Observables: \n", " 1) y = 38 L(0 - 40) B(10) \"y\"\n", "Binned Dataset dh (binned version of d)\n", " Contains 10 bins with a total weight of 500\n", " Observables: 1) y = 38 L(0 - 40) B(10) \"y\"\n" ] } ], "source": [ "cout << \">> Creating 1-dimensional projection on y of dh for bins with x>0\" << endl;\n", "std::unique_ptr dh2{dh.reduce(y, \"x>0\")};\n", "dh2->Print(\"v\");" ] }, { "cell_type": "markdown", "id": "c2e67d02", "metadata": {}, "source": [ "Add dh2 to yframe and redraw" ] }, { "cell_type": "code", "execution_count": 15, "id": "83374638", "metadata": { "collapsed": false, "execution": { "iopub.execute_input": "2026-05-19T20:31:35.468795Z", "iopub.status.busy": "2026-05-19T20:31:35.468681Z", "iopub.status.idle": "2026-05-19T20:31:35.677023Z", "shell.execute_reply": "2026-05-19T20:31:35.676409Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[#1] INFO:Plotting -- RooPlot::updateFitRangeNorm: New event count of 500 will supersede previous event count of 1000 for normalization of PDF projections\n" ] } ], "source": [ "dh2->plotOn(yframe, LineColor(kRed), MarkerColor(kRed));" ] }, { "cell_type": "markdown", "id": "be1e87d9", "metadata": {}, "source": [ "Saving and loading from file\n", "-------------------------------------------------------" ] }, { "cell_type": "markdown", "id": "471b84b5", "metadata": {}, "source": [ "Datasets can be persisted with ROOT I/O" ] }, { "cell_type": "code", "execution_count": 16, "id": "95463819", "metadata": { "collapsed": false, "execution": { "iopub.execute_input": "2026-05-19T20:31:35.678837Z", "iopub.status.busy": "2026-05-19T20:31:35.678716Z", "iopub.status.idle": "2026-05-19T20:31:36.023385Z", "shell.execute_reply": "2026-05-19T20:31:36.022783Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", ">> Persisting d via ROOT I/O\n", "TFile**\t\trf402_datahandling.root\t\n", " TFile*\t\trf402_datahandling.root\t\n", " KEY: RooDataSet\td;1\td\n", " KEY: TProcessID\tProcessID0;1\tb8841ddf-53c1-11f1-beb4-0200590abeef\n" ] } ], "source": [ "cout << endl << \">> Persisting d via ROOT I/O\" << endl;\n", "TFile f(\"rf402_datahandling.root\", \"RECREATE\");\n", "d.Write();\n", "f.ls();" ] }, { "cell_type": "markdown", "id": "f556ff64", "metadata": {}, "source": [ "To read back in future session:\n", "> TFile f(\"rf402_datahandling.root\") ;\n", "> RooDataSet* d = (RooDataSet*) f.FindObject(\"d\") ;" ] }, { "cell_type": "code", "execution_count": 17, "id": "613c0048", "metadata": { "collapsed": false, "execution": { "iopub.execute_input": "2026-05-19T20:31:36.025066Z", "iopub.status.busy": "2026-05-19T20:31:36.024950Z", "iopub.status.idle": "2026-05-19T20:31:36.233070Z", "shell.execute_reply": "2026-05-19T20:31:36.232408Z" } }, "outputs": [], "source": [ "new TCanvas(\"rf402_datahandling\", \"rf402_datahandling\", 600, 600);\n", "gPad->SetLeftMargin(0.15);\n", "yframe->GetYaxis()->SetTitleOffset(1.4);\n", "yframe->Draw();" ] }, { "cell_type": "markdown", "id": "a9473c94", "metadata": {}, "source": [ "Draw all canvases " ] }, { "cell_type": "code", "execution_count": 18, "id": "086187ac", "metadata": { "collapsed": false, "execution": { "iopub.execute_input": "2026-05-19T20:31:36.235182Z", "iopub.status.busy": "2026-05-19T20:31:36.235058Z", "iopub.status.idle": "2026-05-19T20:31:36.473013Z", "shell.execute_reply": "2026-05-19T20:31:36.472424Z" } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "
\n", "
\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%jsroot on\n", "gROOT->GetListOfCanvases()->Draw()" ] } ], "metadata": { "kernelspec": { "display_name": "ROOT C++", "language": "c++", "name": "root" }, "language_info": { "codemirror_mode": "text/x-c++src", "file_extension": ".C", "mimetype": " text/x-c++src", "name": "c++" } }, "nbformat": 4, "nbformat_minor": 5 }