Implementation of a Decision Tree In a decision tree successive decision nodes are used to categorize the events out of the sample as either signal or background. Each node uses only a single discriminating variable to decide if the event is signal-like ("goes right") or background-like ("goes left"). This forms a tree like structure with "baskets" at the end (leave nodes), and an event is classified as either signal or background according to whether the basket where it ends up has been classified signal or background during the training. Training of a decision tree is the process to define the "cut criteria" for each node. The training starts with the root node. Here one takes the full training event sample and selects the variable and corresponding cut value that gives the best separation between signal and background at this stage. Using this cut criterion, the sample is then divided into two subsamples, a signal-like (right) and a background-like (left) sample. Two new nodes are then created for each of the two sub-samples and they are constructed using the same mechanism as described for the root node. The devision is stopped once a certain node has reached either a minimum number of events, or a minimum or maximum signal purity. These leave nodes are then called "signal" or "background" if they contain more signal respective background events from the training sample.
void | TMVA::BinaryTree::DeleteNode(TMVA::Node*) |
void | FindMinAndMax(vector<TMVA::Event*>& eventSample, vector<Double_t>& min, vector<Double_t>& max) |
Double_t | GetNodeError(TMVA::DecisionTreeNode* node) |
Double_t | GetSubTreeError(TMVA::DecisionTreeNode* node) |
Double_t | SamplePurity(TMVA::DecisionTree::vector<Event*> eventSample) |
void | SetCutPoints(vector<Double_t>& cut_points, Double_t xmin, Double_t xmax, Int_t num_gridpoints) |
UInt_t | TMVA::BinaryTree::fDepth | maximal depth in tree reached |
TMVA::MsgLogger | TMVA::BinaryTree::fLogger | message loggera |
Int_t | TMVA::BinaryTree::fNNodes | total number of nodes in the tree (counted) |
multimap<Double_t,TMVA::DecisionTreeNode*> | fLinkStrengthMap | prunestrenghts at which the subtree below the node would be pruned |
Double_t | fMinSepGain | min number of separation gain to perform node splitting |
Double_t | fMinSize | min number of events in node |
TRandom2* | fMyTrandom | random number generator for randomised trees |
Int_t | fNCuts | number of grid point in variable cut scans |
Int_t | fNvars | number of variables used to separate S and B |
TMVA::DecisionTree::EPruneMethod | fPruneMethod | method used for prunig |
Double_t | fPruneStrength | a parameter to set the "amount" of pruning..needs to be adjusted |
multimap<Double_t,TMVA::DecisionTreeNode*> | fQualityGainMap | the quality-gain of pre-leaf nodes |
TMVA::SeparationBase* | fQualityIndex | separation/quality criterio for CC-pruning |
multimap<Double_t,TMVA::DecisionTreeNode*> | fQualityMap | the quality of leaf nodes |
Bool_t | fRandomisedTree | choose at each node splitting a random set of variables |
TMVA::SeparationBase* | fSepType | the separation crition |
Int_t | fUseNvars | the number of variables used in randomised trees; |
Bool_t | fUseSearchTree | cut scan done with binary trees or simple event loop. |
vector<Double_t> | fVariableImportance | the relative importance of the different variables |
static const Int_t | fgDebugLevel | debug level determining some printout/control plots etc. |
default constructor using the GiniIndex as separation criterion, no restrictions on minium number of events in a leave note or the separation gain in the node splitting
default constructor using the GiniIndex as separation criterion, no restrictions on minium number of events in a leave note or the separation gain in the node splitting
constructor specifying the separation type, the min number of events in a no that is still subjected to further splitting, the number of bins in the grid used in applying the cut for the node splitting.
copy constructor that creates a true copy, i.e. a completely independent tree the node copy will recursively copy all the nodes
descend a tree to find all its leaf nodes, fill max depth reached in the tree at the same time.
building the decision tree by recursively calling the splitting of one (root-) node into two daughter nodes (returns the number of nodes)
fill the existing the decision tree structure by filling event in from the top node and see where they happen to end up
fill the existing the decision tree structure by filling event in from the top node and see where they happen to end up
remove those last splits that result in two leaf nodes that are both of the type background
prune (get rid of internal nodes) the Decision tree to avoid overtraining
serveral different pruning methods can be applied as selected by the
variable "fPruneMethod". Currently however only the Expected Error Pruning
is implemented
recursive prunig of nodes using the Expected Error Pruning (EEP) if internal node, then prune
prunig of nodes using the Cost Complexity criteria. The Pruning is performed until a minimum in the cost complexity CC(alpha) is reached. CC(alpha) = alpha*NLeafs + sum_over_leafs[ N*Quality(leaf) ] where Quality(leaf) is given by the 1-purity (for Misclassification Error) purity(1-purity) for Gini-Index..e.t.c.. typically the Misclassification Error is used for guiding the pruning.
Similar to the CostCoplexity pruning, only here I calculate immediately
the "prunestrength" (= alpha, the regularisation parameter in the CostComplexity)
for which the respective subtree below a node would be pruned. Then I continue
pruning until all nodes have such a value larger than the specified prunestrength
loop over all non-leaf nodes of the tree and calculate for each
of these nodes the prunestrenght ("alpha") at which this node
would win against its subtree.(hence it's subtree would be pruned)
this is given by:
R(t) - R(T_t) where R(t) : MisClassificationCost of the node t
alpha < -------------- R(T_t): MisClassificationCost of subtree blow t
|T_t| - 1 |T| : # nodes in Tree T
get the misclassificationCost of the subTree
get the misclassificationCost of the subTree
return the number of terminal nodes in the sub-tree below Node n
returns the cost complexity criterion for the decision tree see "L.Breiman, J.H.Friedman, R.A.Olshen, C.J.Stone; "Classification and Regression Trees", Wadsworth International Group (1984), Chapman & Hall/CRC (1984)
returns the cost complexity criterion for the decision tree see "L.Breiman, J.H.Friedman, R.A.Olshen, C.J.Stone; "Classification and Regression Trees", Wadsworth International Group (1984), Chapman & Hall/CRC (1984)
traverse the whole tree and fill the map of QualityGain - Node for pre-leaf nodes, deciding which node is the next prune candidate
traverse the whole tree and find the leaf nodes, and then fill the Quality Criterion for the leaf nodes (Used in the Pruning)
get the pruning candidate node as the one pre-leaf node that gives the smalest quality gain
calculate an UPPER limit on the error made by the classification done by this node. If the S/S+B of the node is f, then according to the training sample, the error rate (fraction of misclassified events by this node) is (1-f) now f has a statistical error according to the binomial distribution hence the error on f can be estimated (same error as the binomial error for efficency calculations ( sigma = sqrt(eff(1-eff)/nEvts ) )
calculate the expected statistical error on the subtree below "node"
which is used in the expected error pruning
get left daughter node current node "n"
get right daughter node current node "n"
retrieve node from the tree. Its position (up to a maximal tree depth of 64)
is coded as a sequence of left-right moves starting from the root, coded as
0-1 bit patterns stored in the "long-integer" (i.e. 0:left ; 1:right
helper function which calculates gets the Min and Max value of the event variables in the current event sample
helper function which calculates the grid points used for the cut scan
decide how to split a node. At each node, ONE of the variables is choosen, which gives the best separation between signal and bkg on the sample which enters the Node. In order to do this, for each variable a scan of the different cut values in a grid (grid = fNCuts) is performed and the resulting separation gains are compared.. This cut scan uses either a binary search tree or a simple loop over the events depending on the number of events in the sample
the event e is put into the decision tree (starting at the root node) and the output is NodeType (signal) or (background) of the final node (basket) in which the given events ends up. I.e. the result of the classification if the event for this decision tree.
calculates the purity S/(S+B) of a given event sample
return the relative variable importance, normalized to all variables together having the importance 1. The importance in evaluated as the total separation-gain that this variable had in the decision trees (weighted by the number of events)
{ fPruneMethod = m; }