{
"cells": [
{
"cell_type": "markdown",
"id": "da29a309",
"metadata": {},
"source": [
"# rf402_datahandling\n",
"Data and categories: tools for manipulation of (un)binned datasets\n",
"\n",
"\n",
"\n",
"\n",
"**Author:** Wouter Verkerke \n",
"This notebook tutorial was automatically generated with ROOTBOOK-izer from the macro found in the ROOT repository on Tuesday, May 19, 2026 at 08:31 PM."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "6a290aef",
"metadata": {
"collapsed": false,
"execution": {
"iopub.execute_input": "2026-05-19T20:31:32.588775Z",
"iopub.status.busy": "2026-05-19T20:31:32.588667Z",
"iopub.status.idle": "2026-05-19T20:31:32.602822Z",
"shell.execute_reply": "2026-05-19T20:31:32.602330Z"
}
},
"outputs": [],
"source": [
"%%cpp -d\n",
"#include \"RooRealVar.h\"\n",
"#include \"RooDataSet.h\"\n",
"#include \"RooDataHist.h\"\n",
"#include \"RooGaussian.h\"\n",
"#include \"RooCategory.h\"\n",
"#include \"TCanvas.h\"\n",
"#include \"TAxis.h\"\n",
"#include \"RooPlot.h\"\n",
"#include \"TFile.h\"\n",
"using namespace RooFit;"
]
},
{
"cell_type": "markdown",
"id": "ba608f2e",
"metadata": {},
"source": [
"Binned (RooDataHist) and unbinned datasets (RooDataSet) share\n",
"many properties and inherit from a common abstract base class\n",
"(RooAbsData), that provides an interface for all operations\n",
"that can be performed regardless of the data format"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "11078b4c",
"metadata": {
"collapsed": false,
"execution": {
"iopub.execute_input": "2026-05-19T20:31:32.604448Z",
"iopub.status.busy": "2026-05-19T20:31:32.604336Z",
"iopub.status.idle": "2026-05-19T20:31:32.954306Z",
"shell.execute_reply": "2026-05-19T20:31:32.953614Z"
}
},
"outputs": [],
"source": [
"RooRealVar x(\"x\", \"x\", -10, 10);\n",
"RooRealVar y(\"y\", \"y\", 0, 40);\n",
"RooCategory c(\"c\", \"c\");\n",
"c.defineType(\"Plus\", +1);\n",
"c.defineType(\"Minus\", -1);"
]
},
{
"cell_type": "markdown",
"id": "80ccb6df",
"metadata": {},
"source": [
"Basic Operations on unbinned datasets\n",
"--------------------------------------------------------------"
]
},
{
"cell_type": "markdown",
"id": "fdb8fd6b",
"metadata": {},
"source": [
"RooDataSet is an unbinned dataset (a collection of points in N-dimensional space)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "e19993e3",
"metadata": {
"collapsed": false,
"execution": {
"iopub.execute_input": "2026-05-19T20:31:32.955986Z",
"iopub.status.busy": "2026-05-19T20:31:32.955873Z",
"iopub.status.idle": "2026-05-19T20:31:33.163705Z",
"shell.execute_reply": "2026-05-19T20:31:33.163274Z"
}
},
"outputs": [],
"source": [
"RooDataSet d(\"d\", \"d\", RooArgSet(x, y, c));"
]
},
{
"cell_type": "markdown",
"id": "f02e0ab4",
"metadata": {},
"source": [
"Unlike RooAbsArgs (RooAbsPdf,RooFormulaVar,....) datasets are not attached to\n",
"the variables they are constructed from. Instead they are attached to an internal\n",
"clone of the supplied set of arguments"
]
},
{
"cell_type": "markdown",
"id": "6d43d4d3",
"metadata": {},
"source": [
"Fill d with dummy values"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "f150f12a",
"metadata": {
"collapsed": false,
"execution": {
"iopub.execute_input": "2026-05-19T20:31:33.176279Z",
"iopub.status.busy": "2026-05-19T20:31:33.176127Z",
"iopub.status.idle": "2026-05-19T20:31:33.386113Z",
"shell.execute_reply": "2026-05-19T20:31:33.385630Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"DataStore d (d)\n",
" Contains 1000 entries\n",
" Observables: \n",
" 1) x = 9 L(-10 - 10) \"x\"\n",
" 2) y = 31.607 L(0 - 40) \"y\"\n",
" 3) c = Plus(idx = 1)\n",
" \"c\"\n",
"\n"
]
}
],
"source": [
"Int_t i;\n",
"for (i = 0; i < 1000; i++) {\n",
" x = i / 50 - 10;\n",
" y = sqrt(1.0 * i);\n",
" c.setLabel((i % 2) ? \"Plus\" : \"Minus\");\n",
"\n",
" // We must explicitly refer to x,y,c here to pass the values because\n",
" // d is not linked to them (as explained above)\n",
" d.add(RooArgSet(x, y, c));\n",
"}\n",
"d.Print(\"v\");\n",
"cout << endl;"
]
},
{
"cell_type": "markdown",
"id": "01c79eb5",
"metadata": {},
"source": [
"The get() function returns a pointer to the internal copy of the RooArgSet(x,y,c)\n",
"supplied in the constructor"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "ff1f857b",
"metadata": {
"collapsed": false,
"execution": {
"iopub.execute_input": "2026-05-19T20:31:33.387635Z",
"iopub.status.busy": "2026-05-19T20:31:33.387508Z",
"iopub.status.idle": "2026-05-19T20:31:33.595751Z",
"shell.execute_reply": "2026-05-19T20:31:33.595245Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 1) 0x7f2531092ce0 RooRealVar:: x = 9 L(-10 - 10) \"x\"\n",
" 2) 0x7f25311738c0 RooRealVar:: y = 31.607 L(0 - 40) \"y\"\n",
" 3) 0x7f25310d8870 RooCategory:: c = Plus(idx = 1)\n",
" \"c\"\n",
"\n"
]
}
],
"source": [
"const RooArgSet *row = d.get();\n",
"row->Print(\"v\");\n",
"cout << endl;"
]
},
{
"cell_type": "markdown",
"id": "03a3e2c4",
"metadata": {},
"source": [
"Get with an argument loads a specific data point in row and returns\n",
"a pointer to row argset. get() always returns the same pointer, unless\n",
"an invalid row number is specified. In that case a null ptr is returned"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "cef015a2",
"metadata": {
"collapsed": false,
"execution": {
"iopub.execute_input": "2026-05-19T20:31:33.597262Z",
"iopub.status.busy": "2026-05-19T20:31:33.597152Z",
"iopub.status.idle": "2026-05-19T20:31:33.805645Z",
"shell.execute_reply": "2026-05-19T20:31:33.804861Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 1) 0x7f2531092ce0 RooRealVar:: x = 8 L(-10 - 10) \"x\"\n",
" 2) 0x7f25311738c0 RooRealVar:: y = 30 L(0 - 40) \"y\"\n",
" 3) 0x7f25310d8870 RooCategory:: c = Minus(idx = -1)\n",
" \"c\"\n",
"\n"
]
}
],
"source": [
"d.get(900)->Print(\"v\");\n",
"cout << endl;"
]
},
{
"cell_type": "markdown",
"id": "b926964c",
"metadata": {},
"source": [
"Reducing, Appending and Merging\n",
"-------------------------------------------------------------"
]
},
{
"cell_type": "markdown",
"id": "1a6ca657",
"metadata": {},
"source": [
"The reduce() function returns a new dataset which is a subset of the original"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "59098c38",
"metadata": {
"collapsed": false,
"execution": {
"iopub.execute_input": "2026-05-19T20:31:33.807050Z",
"iopub.status.busy": "2026-05-19T20:31:33.806940Z",
"iopub.status.idle": "2026-05-19T20:31:34.013395Z",
"shell.execute_reply": "2026-05-19T20:31:34.012986Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
">> d1 has only columns x,c\n",
"DataStore d (d)\n",
" Contains 1000 entries\n",
" Observables: \n",
" 1) x = 9 L(-10 - 10) \"x\"\n",
" 2) c = Plus(idx = 1)\n",
" \"c\"\n",
"\n",
">> d2 has only column y\n",
"DataStore d (d)\n",
" Contains 1000 entries\n",
" Observables: \n",
" 1) y = 31.607 L(0 - 40) \"y\"\n",
"\n",
">> d3 has only the points with y>5.17\n",
"DataStore d (d)\n",
" Contains 973 entries\n",
" Observables: \n",
" 1) x = 9 L(-10 - 10) \"x\"\n",
" 2) y = 31.607 L(0 - 40) \"y\"\n",
" 3) c = Plus(idx = 1)\n",
" \"c\"\n",
"\n",
">> d4 has only columns x,c for data points with y>5.17\n",
"DataStore d (d)\n",
" Contains 973 entries\n",
" Observables: \n",
" 1) x = 9 L(-10 - 10) \"x\"\n",
" 2) c = Plus(idx = 1)\n",
" \"c\"\n"
]
}
],
"source": [
"cout << endl << \">> d1 has only columns x,c\" << endl;\n",
"std::unique_ptr d1{d.reduce({x, c})};\n",
"d1->Print(\"v\");\n",
"\n",
"cout << endl << \">> d2 has only column y\" << endl;\n",
"std::unique_ptr d2{d.reduce({y})};\n",
"d2->Print(\"v\");\n",
"\n",
"cout << endl << \">> d3 has only the points with y>5.17\" << endl;\n",
"std::unique_ptr d3{d.reduce(\"y>5.17\")};\n",
"d3->Print(\"v\");\n",
"\n",
"cout << endl << \">> d4 has only columns x,c for data points with y>5.17\" << endl;\n",
"std::unique_ptr d4{d.reduce({x, c}, \"y>5.17\")};\n",
"d4->Print(\"v\");"
]
},
{
"cell_type": "markdown",
"id": "c13be878",
"metadata": {},
"source": [
"The merge() function adds two data set column-wise"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "1141cadb",
"metadata": {
"collapsed": false,
"execution": {
"iopub.execute_input": "2026-05-19T20:31:34.014961Z",
"iopub.status.busy": "2026-05-19T20:31:34.014849Z",
"iopub.status.idle": "2026-05-19T20:31:34.220409Z",
"shell.execute_reply": "2026-05-19T20:31:34.219950Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
">> merge d2(y) with d1(x,c) to form d1(x,c,y)\n",
"DataStore d (d)\n",
" Contains 1000 entries\n",
" Observables: \n",
" 1) x = 9 L(-10 - 10) \"x\"\n",
" 2) c = Plus(idx = 1)\n",
" \"c\"\n",
" 3) y = 31.607 L(0 - 40) \"y\"\n"
]
}
],
"source": [
"cout << endl << \">> merge d2(y) with d1(x,c) to form d1(x,c,y)\" << endl;\n",
"static_cast(*d1).merge(&static_cast(*d2));\n",
"d1->Print(\"v\");"
]
},
{
"cell_type": "markdown",
"id": "15b07c56",
"metadata": {},
"source": [
"The append() function adds two datasets row-wise"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "dd0d7a78",
"metadata": {
"collapsed": false,
"execution": {
"iopub.execute_input": "2026-05-19T20:31:34.222003Z",
"iopub.status.busy": "2026-05-19T20:31:34.221892Z",
"iopub.status.idle": "2026-05-19T20:31:34.427642Z",
"shell.execute_reply": "2026-05-19T20:31:34.427087Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
">> append data points of d3 to d1\n",
"DataStore d (d)\n",
" Contains 1973 entries\n",
" Observables: \n",
" 1) x = 9 L(-10 - 10) \"x\"\n",
" 2) c = Plus(idx = 1)\n",
" \"c\"\n",
" 3) y = 31.607 L(0 - 40) \"y\"\n"
]
}
],
"source": [
"cout << endl << \">> append data points of d3 to d1\" << endl;\n",
"static_cast(*d1).append(static_cast(*d3));\n",
"d1->Print(\"v\");"
]
},
{
"cell_type": "markdown",
"id": "71343637",
"metadata": {},
"source": [
"Operations on binned datasets\n",
"---------------------------------------------------------"
]
},
{
"cell_type": "markdown",
"id": "7560a700",
"metadata": {},
"source": [
"A binned dataset can be constructed empty, from an unbinned dataset, or\n",
"from a ROOT native histogram (TH1,2,3)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "34a58772",
"metadata": {
"collapsed": false,
"execution": {
"iopub.execute_input": "2026-05-19T20:31:34.429174Z",
"iopub.status.busy": "2026-05-19T20:31:34.429058Z",
"iopub.status.idle": "2026-05-19T20:31:34.637557Z",
"shell.execute_reply": "2026-05-19T20:31:34.637027Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
">> construct dh (binned) from d(unbinned) but only take the x and y dimensions,\n",
">> the category 'c' will be projected in the filling process\n"
]
}
],
"source": [
"cout << \">> construct dh (binned) from d(unbinned) but only take the x and y dimensions,\" << endl\n",
" << \">> the category 'c' will be projected in the filling process\" << endl;"
]
},
{
"cell_type": "markdown",
"id": "87218d30",
"metadata": {},
"source": [
"The binning of real variables (like x,y) is done using their fit range\n",
"'get/setRange()' and number of specified fit bins 'get/setBins()'.\n",
"Category dimensions of binned datasets get one bin per defined category state"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "0a135acc",
"metadata": {
"collapsed": false,
"execution": {
"iopub.execute_input": "2026-05-19T20:31:34.639053Z",
"iopub.status.busy": "2026-05-19T20:31:34.638943Z",
"iopub.status.idle": "2026-05-19T20:31:34.844929Z",
"shell.execute_reply": "2026-05-19T20:31:34.844386Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"DataStore dh (binned version of d)\n",
" Contains 100 entries\n",
" Observables: \n",
" 1) x = 9 L(-10 - 10) B(10) \"x\"\n",
" 2) y = 38 L(0 - 40) B(10) \"y\"\n",
"Binned Dataset dh (binned version of d)\n",
" Contains 100 bins with a total weight of 1000\n",
" Observables: 1) x = 9 L(-10 - 10) B(10) \"x\"\n",
" 2) y = 38 L(0 - 40) B(10) \"y\"\n"
]
}
],
"source": [
"x.setBins(10);\n",
"y.setBins(10);\n",
"RooDataHist dh(\"dh\", \"binned version of d\", RooArgSet(x, y), d);\n",
"dh.Print(\"v\");\n",
"\n",
"RooPlot *yframe = y.frame(Bins(10), Title(\"Operations on binned datasets\"));\n",
"dh.plotOn(yframe); // plot projection of 2D binned data on y"
]
},
{
"cell_type": "markdown",
"id": "af286b6b",
"metadata": {},
"source": [
"Examine the statistics of a binned dataset"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "ccde58fd",
"metadata": {
"collapsed": false,
"execution": {
"iopub.execute_input": "2026-05-19T20:31:34.846673Z",
"iopub.status.busy": "2026-05-19T20:31:34.846544Z",
"iopub.status.idle": "2026-05-19T20:31:35.052636Z",
"shell.execute_reply": "2026-05-19T20:31:35.051674Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
">> number of bins in dh : 100\n",
">> sum of weights in dh : 1000\n",
">> integral over histogram: 8000\n"
]
}
],
"source": [
"cout << \">> number of bins in dh : \" << dh.numEntries() << endl;\n",
"cout << \">> sum of weights in dh : \" << dh.sum(false) << endl;\n",
"cout << \">> integral over histogram: \" << dh.sum(true) << endl; // accounts for bin volume"
]
},
{
"cell_type": "markdown",
"id": "ad9a1e5e",
"metadata": {},
"source": [
"Locate a bin from a set of coordinates and retrieve its properties"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "c18e7851",
"metadata": {
"collapsed": false,
"execution": {
"iopub.execute_input": "2026-05-19T20:31:35.054136Z",
"iopub.status.busy": "2026-05-19T20:31:35.053972Z",
"iopub.status.idle": "2026-05-19T20:31:35.259939Z",
"shell.execute_reply": "2026-05-19T20:31:35.259335Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
">> retrieving the properties of the bin enclosing coordinate (x,y) = (0.3,20.5) \n",
" bin center:\n",
" 1) 0x7f2531619120 RooRealVar:: x = 1 L(-10 - 10) B(10) \"x\"\n",
" 2) 0x7f253133df80 RooRealVar:: y = 22 L(0 - 40) B(10) \"y\"\n",
" weight = 76\n"
]
}
],
"source": [
"x = 0.3;\n",
"y = 20.5;\n",
"cout << \">> retrieving the properties of the bin enclosing coordinate (x,y) = (0.3,20.5) \" << endl;\n",
"cout << \" bin center:\" << endl;\n",
"dh.get(RooArgSet(x, y))->Print(\"v\"); // load bin center coordinates in internal buffer\n",
"cout << \" weight = \" << dh.weight() << endl; // return weight of last loaded coordinates"
]
},
{
"cell_type": "markdown",
"id": "e39c9721",
"metadata": {},
"source": [
"Reduce the 2-dimensional binned dataset to a 1-dimensional binned dataset\n",
"\n",
"All reduce() methods are interfaced in RooAbsData. All reduction techniques\n",
"demonstrated on unbinned datasets can be applied to binned datasets as well."
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "b21a7fa0",
"metadata": {
"collapsed": false,
"execution": {
"iopub.execute_input": "2026-05-19T20:31:35.261638Z",
"iopub.status.busy": "2026-05-19T20:31:35.261488Z",
"iopub.status.idle": "2026-05-19T20:31:35.467204Z",
"shell.execute_reply": "2026-05-19T20:31:35.466745Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
">> Creating 1-dimensional projection on y of dh for bins with x>0\n",
"DataStore dh (binned version of d)\n",
" Contains 10 entries\n",
" Observables: \n",
" 1) y = 38 L(0 - 40) B(10) \"y\"\n",
"Binned Dataset dh (binned version of d)\n",
" Contains 10 bins with a total weight of 500\n",
" Observables: 1) y = 38 L(0 - 40) B(10) \"y\"\n"
]
}
],
"source": [
"cout << \">> Creating 1-dimensional projection on y of dh for bins with x>0\" << endl;\n",
"std::unique_ptr dh2{dh.reduce(y, \"x>0\")};\n",
"dh2->Print(\"v\");"
]
},
{
"cell_type": "markdown",
"id": "c2e67d02",
"metadata": {},
"source": [
"Add dh2 to yframe and redraw"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "83374638",
"metadata": {
"collapsed": false,
"execution": {
"iopub.execute_input": "2026-05-19T20:31:35.468795Z",
"iopub.status.busy": "2026-05-19T20:31:35.468681Z",
"iopub.status.idle": "2026-05-19T20:31:35.677023Z",
"shell.execute_reply": "2026-05-19T20:31:35.676409Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[#1] INFO:Plotting -- RooPlot::updateFitRangeNorm: New event count of 500 will supersede previous event count of 1000 for normalization of PDF projections\n"
]
}
],
"source": [
"dh2->plotOn(yframe, LineColor(kRed), MarkerColor(kRed));"
]
},
{
"cell_type": "markdown",
"id": "be1e87d9",
"metadata": {},
"source": [
"Saving and loading from file\n",
"-------------------------------------------------------"
]
},
{
"cell_type": "markdown",
"id": "471b84b5",
"metadata": {},
"source": [
"Datasets can be persisted with ROOT I/O"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "95463819",
"metadata": {
"collapsed": false,
"execution": {
"iopub.execute_input": "2026-05-19T20:31:35.678837Z",
"iopub.status.busy": "2026-05-19T20:31:35.678716Z",
"iopub.status.idle": "2026-05-19T20:31:36.023385Z",
"shell.execute_reply": "2026-05-19T20:31:36.022783Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
">> Persisting d via ROOT I/O\n",
"TFile**\t\trf402_datahandling.root\t\n",
" TFile*\t\trf402_datahandling.root\t\n",
" KEY: RooDataSet\td;1\td\n",
" KEY: TProcessID\tProcessID0;1\tb8841ddf-53c1-11f1-beb4-0200590abeef\n"
]
}
],
"source": [
"cout << endl << \">> Persisting d via ROOT I/O\" << endl;\n",
"TFile f(\"rf402_datahandling.root\", \"RECREATE\");\n",
"d.Write();\n",
"f.ls();"
]
},
{
"cell_type": "markdown",
"id": "f556ff64",
"metadata": {},
"source": [
"To read back in future session:\n",
"> TFile f(\"rf402_datahandling.root\") ;\n",
"> RooDataSet* d = (RooDataSet*) f.FindObject(\"d\") ;"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "613c0048",
"metadata": {
"collapsed": false,
"execution": {
"iopub.execute_input": "2026-05-19T20:31:36.025066Z",
"iopub.status.busy": "2026-05-19T20:31:36.024950Z",
"iopub.status.idle": "2026-05-19T20:31:36.233070Z",
"shell.execute_reply": "2026-05-19T20:31:36.232408Z"
}
},
"outputs": [],
"source": [
"new TCanvas(\"rf402_datahandling\", \"rf402_datahandling\", 600, 600);\n",
"gPad->SetLeftMargin(0.15);\n",
"yframe->GetYaxis()->SetTitleOffset(1.4);\n",
"yframe->Draw();"
]
},
{
"cell_type": "markdown",
"id": "a9473c94",
"metadata": {},
"source": [
"Draw all canvases "
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "086187ac",
"metadata": {
"collapsed": false,
"execution": {
"iopub.execute_input": "2026-05-19T20:31:36.235182Z",
"iopub.status.busy": "2026-05-19T20:31:36.235058Z",
"iopub.status.idle": "2026-05-19T20:31:36.473013Z",
"shell.execute_reply": "2026-05-19T20:31:36.472424Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"\n",
"
\n",
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"%jsroot on\n",
"gROOT->GetListOfCanvases()->Draw()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "ROOT C++",
"language": "c++",
"name": "root"
},
"language_info": {
"codemirror_mode": "text/x-c++src",
"file_extension": ".C",
"mimetype": " text/x-c++src",
"name": "c++"
}
},
"nbformat": 4,
"nbformat_minor": 5
}