Logo ROOT  
Reference Guide
 
Loading...
Searching...
No Matches
tmva100_DataPreparation.py File Reference

Detailed Description

View in nbviewer Open in SWAN
This tutorial illustrates how to prepare ROOT datasets to be nicely readable by most machine learning methods.

This requires filtering the initial complex datasets and writing the data in a flat format.

import ROOT
def filter_events(df):
"""
Reduce initial dataset to only events which shall be used for training
"""
return df.Filter("nElectron>=2 && nMuon>=2", "At least two electrons and two muons")
def define_variables(df):
"""
Define the variables which shall be used for training
"""
return df.Define("Muon_pt_1", "Muon_pt[0]")\
.Define("Muon_pt_2", "Muon_pt[1]")\
.Define("Electron_pt_1", "Electron_pt[0]")\
.Define("Electron_pt_2", "Electron_pt[1]")
variables = ["Muon_pt_1", "Muon_pt_2", "Electron_pt_1", "Electron_pt_2"]
if __name__ == "__main__":
for filename, label in [["SMHiggsToZZTo4L.root", "signal"], ["ZZTo2e2mu.root", "background"]]:
print(">>> Extract the training and testing events for {} from the {} dataset.".format(
label, filename))
# Load dataset, filter the required events and define the training variables
filepath = "root://eospublic.cern.ch//eos/root-eos/cms_opendata_2012_nanoaod/" + filename
df = ROOT.RDataFrame("Events", filepath)
df = filter_events(df)
df = define_variables(df)
# Book cutflow report
report = df.Report()
# Split dataset by event number for training and testing
columns = ROOT.std.vector["string"](variables)
df.Filter("event % 2 == 0", "Select events with even event number for training")\
.Snapshot("Events", "train_" + label + ".root", columns)
df.Filter("event % 2 == 1", "Select events with odd event number for training")\
.Snapshot("Events", "test_" + label + ".root", columns)
# Print cutflow report
report.Print()
Option_t Option_t TPoint TPoint const char GetTextMagnitude GetFillStyle GetLineColor GetLineWidth GetMarkerStyle GetTextAlign GetTextColor GetTextSize void char Point_t Rectangle_t WindowAttributes_t Float_t Float_t Float_t Int_t Int_t UInt_t UInt_t Rectangle_t Int_t Int_t Window_t TString Int_t GCValues_t GetPrimarySelectionOwner GetDisplay GetScreen GetColormap GetNativeEvent const char const char dpyName wid window const char font_name cursor keysym reg const char only_if_exist regb h Point_t winding char text const char depth char const char Int_t count const char ColorStruct_t color const char Pixmap_t Pixmap_t PictureAttributes_t attr const char char ret_data h unsigned char height h Atom_t Int_t ULong_t ULong_t unsigned char prop_list Atom_t Atom_t Atom_t Time_t format
ROOT's RDataFrame offers a modern, high-level interface for analysis of data stored in TTree ,...
At least two electrons and two muons: pass=45352 all=299973 -- eff=15.12 % cumulative eff=15.12 %
At least two electrons and two muons: pass=262776 all=1497445 -- eff=17.55 % cumulative eff=17.55 %
>>> Extract the training and testing events for signal from the SMHiggsToZZTo4L.root dataset.
>>> Extract the training and testing events for background from the ZZTo2e2mu.root dataset.
Date
August 2019
Author
Stefan Wunsch

Definition in file tmva100_DataPreparation.py.