Logo ROOT   6.14/05
Reference Guide
HybridStandardForm.C
Go to the documentation of this file.
1 /// \file
2 /// \ingroup tutorial_roostats
3 /// \notebook -js
4 /// A hypothesis testing example based on number counting with background uncertainty.
5 ///
6 /// A hypothesis testing example based on number counting
7 /// with background uncertainty.
8 ///
9 /// NOTE: This example is like HybridInstructional, but the model is more clearly
10 /// generalizable to an analysis with shapes. There is a lot of flexibility
11 /// for how one models a problem in RooFit/RooStats. Models come in a few
12 /// common forms:
13 /// - standard form: extended PDF of some discriminating variable m:
14 /// eg: P(m) ~ S*fs(m) + B*fb(m), with S+B events expected
15 /// in this case the dataset has N rows corresponding to N events
16 /// and the extended term is Pois(N|S+B)
17 ///
18 /// - fractional form: non-extended PDF of some discriminating variable m:
19 /// eg: P(m) ~ s*fs(m) + (1-s)*fb(m), where s is a signal fraction
20 /// in this case the dataset has N rows corresponding to N events
21 /// and there is no extended term
22 ///
23 /// - number counting form: in which there is no discriminating variable
24 /// and the counts are modeled directly (see HybridInstructional)
25 /// eg: P(N) = Pois(N|S+B)
26 /// in this case the dataset has 1 row corresponding to N events
27 /// and the extended term is the PDF itself.
28 ///
29 /// Here we convert the number counting form into the standard form by
30 /// introducing a dummy discriminating variable m with a uniform distribution.
31 ///
32 /// This example:
33 /// - demonstrates the usage of the HybridCalcultor (Part 4-6)
34 /// - demonstrates the numerical integration of RooFit (Part 2)
35 /// - validates the RooStats against an example with a known analytic answer
36 /// - demonstrates usage of different test statistics
37 /// - explains subtle choices in the prior used for hybrid methods
38 /// - demonstrates usage of different priors for the nuisance parameters
39 /// - demonstrates usage of PROOF
40 ///
41 /// The basic setup here is that a main measurement has observed x events with an
42 /// expectation of s+b. One can choose an ad hoc prior for the uncertainty on b,
43 /// or try to base it on an auxiliary measurement. In this case, the auxiliary
44 /// measurement (aka control measurement, sideband) is another counting experiment
45 /// with measurement y and expectation tau*b. With an 'original prior' on b,
46 /// called \f$ \eta(b) \f$ then one can obtain a posterior from the auxiliary measurement
47 /// \f$ \pi(b) = \eta(b) * Pois(y|tau*b) \f$. This is a principled choice for a prior
48 /// on b in the main measurement of x, which can then be treated in a hybrid
49 /// Bayesian/Frequentist way. Additionally, one can try to treat the two
50 /// measurements simultaneously, which is detailed in Part 6 of the tutorial.
51 ///
52 /// This tutorial is related to the FourBin.C tutorial in the modeling, but
53 /// focuses on hypothesis testing instead of interval estimation.
54 ///
55 /// More background on this 'prototype problem' can be found in the
56 /// following papers:
57 ///
58 /// - Evaluation of three methods for calculating statistical significance
59 /// when incorporating a systematic uncertainty into a test of the
60 /// background-only hypothesis for a Poisson process
61 /// Authors: Robert D. Cousins, James T. Linnemann, Jordan Tucker
62 /// http://arxiv.org/abs/physics/0702156
63 /// NIM A 595 (2008) 480--501
64 ///
65 /// - Statistical Challenges for Searches for New Physics at the LHC
66 /// Author: Kyle Cranmer
67 /// http://arxiv.org/abs/physics/0511028
68 ///
69 /// - Measures of Significance in HEP and Astrophysics
70 /// Author: J. T. Linnemann
71 /// http://arxiv.org/abs/physics/0312059
72 ///
73 /// \macro_image
74 /// \macro_output
75 /// \macro_code
76 ///
77 /// \authors Kyle Cranmer, Wouter Verkerke, and Sven Kreiss
78 
79 #include "RooGlobalFunc.h"
80 #include "RooRealVar.h"
81 #include "RooProdPdf.h"
82 #include "RooWorkspace.h"
83 #include "RooDataSet.h"
84 #include "RooDataHist.h"
85 #include "TCanvas.h"
86 #include "TStopwatch.h"
87 #include "TH1.h"
88 #include "RooPlot.h"
89 #include "RooMsgService.h"
90 
92 
94 #include "RooStats/ToyMCSampler.h"
95 #include "RooStats/HypoTestPlot.h"
96 
102 
103 using namespace RooFit;
104 using namespace RooStats;
105 
106 //-------------------------------------------------------
107 // A New Test Statistic Class for this example.
108 // It simply returns the sum of the values in a particular
109 // column of a dataset.
110 // You can ignore this class and focus on the macro below
111 
112 class BinCountTestStat : public TestStatistic {
113 public:
114  BinCountTestStat(void) : fColumnName("tmp") {}
115  BinCountTestStat(string columnName) : fColumnName(columnName) {}
116 
117  virtual Double_t Evaluate(RooAbsData& data, RooArgSet& /*nullPOI*/) {
118  // This is the main method in the interface
119  Double_t value = 0.0;
120  for(int i=0; i < data.numEntries(); i++) {
121  value += data.get(i)->getRealValue(fColumnName.c_str());
122  }
123  return value;
124  }
125  virtual const TString GetVarName() const { return fColumnName; }
126 
127 private:
128  string fColumnName;
129 
130 protected:
131  ClassDef(BinCountTestStat,1)
132 };
133 
134 ClassImp(BinCountTestStat)
135 
136 //-----------------------------
137 // The Actual Tutorial Macro
138 //-----------------------------
139 
140 void HybridStandardForm() {
141 
142  // This tutorial has 6 parts
143  // Table of Contents
144  // Setup
145  // 1. Make the model for the 'prototype problem'
146  // Special cases
147  // 2. NOT RELEVANT HERE
148  // 3. Use RooStats analytic solution for this problem
149  // RooStats HybridCalculator -- can be generalized
150  // 4. RooStats ToyMC version of 2. & 3.
151  // 5. RooStats ToyMC with an equivalent test statistic
152  // 6. RooStats ToyMC with simultaneous control & main measurement
153 
154  // Part 4 takes ~4 min without PROOF.
155  // Part 5 takes about ~2 min with PROOF on 4 cores.
156  // Of course, everything looks nicer with more toys, which takes longer.
157 
158 
159  TStopwatch t;
160  t.Start();
161  TCanvas *c = new TCanvas;
162  c->Divide(2,2);
163 
164  //-----------------------------------------------------
165  // P A R T 1 : D I R E C T I N T E G R A T I O N
166  // ====================================================
167  // Make model for prototype on/off problem
168  // Pois(x | s+b) * Pois(y | tau b )
169  // for Z_Gamma, use uniform prior on b.
170  RooWorkspace* w = new RooWorkspace("w");
171 
172 
173  // replace the pdf in 'number counting form'
174  //w->factory("Poisson::px(x[150,0,500],sum::splusb(s[0,0,100],b[100,0,300]))");
175  // with one in standard form. Now x is encoded in event count
176  w->factory("Uniform::f(m[0,1])");//m is a dummy discriminating variable
177  w->factory("ExtendPdf::px(f,sum::splusb(s[0,0,100],b[100,0,300]))");
178  w->factory("Poisson::py(y[100,0,500],prod::taub(tau[1.],b))");
179  w->factory("PROD::model(px,py)");
180  w->factory("Uniform::prior_b(b)");
181 
182  // We will control the output level in a few places to avoid
183  // verbose progress messages. We start by keeping track
184  // of the current threshold on messages.
186 
187  // Use PROOF-lite on multi-core machines
188  ProofConfig* pc = NULL;
189  // uncomment below if you want to use PROOF
190  pc = new ProofConfig(*w, 4, "workers=4", kFALSE); // machine with 4 cores
191  // pc = new ProofConfig(*w, 2, "workers=2", kFALSE); // machine with 2 cores
192 
193  //-----------------------------------------------
194  // P A R T 3 : A N A L Y T I C R E S U L T
195  // ==============================================
196  // In this special case, the integrals are known analytically
197  // and they are implemented in RooStats::NumberCountingUtils
198 
199  // analytic Z_Bi
200  double p_Bi = NumberCountingUtils::BinomialWithTauObsP(150, 100, 1);
201  double Z_Bi = NumberCountingUtils::BinomialWithTauObsZ(150, 100, 1);
202  cout << "-----------------------------------------"<<endl;
203  cout << "Part 3" << endl;
204  std::cout << "Z_Bi p-value (analytic): " << p_Bi << std::endl;
205  std::cout << "Z_Bi significance (analytic): " << Z_Bi << std::endl;
206  t.Stop(); t.Print(); t.Reset(); t.Start();
207 
208  //--------------------------------------------------------------
209  // P A R T 4 : U S I N G H Y B R I D C A L C U L A T O R
210  // ==============================================================
211  // Now we demonstrate the RooStats HybridCalculator.
212  //
213  // Like all RooStats calculators it needs the data and a ModelConfig
214  // for the relevant hypotheses. Since we are doing hypothesis testing
215  // we need a ModelConfig for the null (background only) and the alternate
216  // (signal+background) hypotheses. We also need to specify the PDF,
217  // the parameters of interest, and the observables. Furthermore, since
218  // the parameter of interest is floating, we need to specify which values
219  // of the parameter corresponds to the null and alternate (eg. s=0 and s=50)
220  //
221  // define some sets of variables obs={x} and poi={s}
222  // note here, x is the only observable in the main measurement
223  // and y is treated as a separate measurement, which is used
224  // to produce the prior that will be used in this calculation
225  // to randomize the nuisance parameters.
226  w->defineSet("obs","m");
227  w->defineSet("poi","s");
228 
229  // create a toy dataset with the x=150
230  // RooDataSet *data = new RooDataSet("d", "d", *w->set("obs"));
231  // data->add(*w->set("obs"));
232  RooDataSet* data = w->pdf("px")->generate(*w->set("obs"),150);
233 
234  // Part 3a : Setup ModelConfigs
235  //-------------------------------------------------------
236  // create the null (background-only) ModelConfig with s=0
237  ModelConfig b_model("B_model", w);
238  b_model.SetPdf(*w->pdf("px"));
239  b_model.SetObservables(*w->set("obs"));
240  b_model.SetParametersOfInterest(*w->set("poi"));
241  w->var("s")->setVal(0.0); // important!
242  b_model.SetSnapshot(*w->set("poi"));
243 
244  // create the alternate (signal+background) ModelConfig with s=50
245  ModelConfig sb_model("S+B_model", w);
246  sb_model.SetPdf(*w->pdf("px"));
247  sb_model.SetObservables(*w->set("obs"));
248  sb_model.SetParametersOfInterest(*w->set("poi"));
249  w->var("s")->setVal(50.0); // important!
250  sb_model.SetSnapshot(*w->set("poi"));
251 
252 
253  // Part 3b : Choose Test Statistic
254  //--------------------------------------------------------------
255  // To make an equivalent calculation we need to use x as the test
256  // statistic. This is not a built-in test statistic in RooStats
257  // so we define it above. The new class inherits from the
258  // RooStats::TestStatistic interface, and simply returns the value
259  // of x in the dataset.
260 
261  NumEventsTestStat eventCount(*w->pdf("px"));
262 
263  // Part 3c : Define Prior used to randomize nuisance parameters
264  //-------------------------------------------------------------
265  //
266  // The prior used for the hybrid calculator is the posterior
267  // from the auxiliary measurement y. The model for the aux.
268  // measurement is Pois(y|tau*b), thus the likelihood function
269  // is proportional to (has the form of) a Gamma distribution.
270  // if the 'original prior' $\eta(b)$ is uniform, then from
271  // Bayes's theorem we have the posterior:
272  // $\pi(b) = Pois(y|tau*b) * \eta(b)$
273  // If $\eta(b)$ is flat, then we arrive at a Gamma distribution.
274  // Since RooFit will normalize the PDF we can actually supply
275  // py=Pois(y,tau*b) that will be equivalent to multiplying by a uniform.
276  //
277  // Alternatively, we could explicitly use a gamma distribution:
278  //
279  // `w->factory("Gamma::gamma(b,sum::temp(y,1),1,0)");`
280  //
281  // or we can use some other ad hoc prior that do not naturally
282  // follow from the known form of the auxiliary measurement.
283  // The common choice is the equivalent Gaussian:
284  w->factory("Gaussian::gauss_prior(b,y, expr::sqrty('sqrt(y)',y))");
285  // this corresponds to the "Z_N" calculation.
286  //
287  // or one could use the analogous log-normal prior
288  w->factory("Lognormal::lognorm_prior(b,y, expr::kappa('1+1./sqrt(y)',y))");
289  //
290  // Ideally, the HybridCalculator would be able to inspect the full
291  // model Pois(x | s+b) * Pois(y | tau b ) and be given the original
292  // prior $\eta(b)$ to form $\pi(b) = Pois(y|tau*b) * \eta(b)$.
293  // This is not yet implemented because in the general case
294  // it is not easy to identify the terms in the PDF that correspond
295  // to the auxiliary measurement. So for now, it must be set
296  // explicitly with:
297  // - ForcePriorNuisanceNull()
298  // - ForcePriorNuisanceAlt()
299  // the name "ForcePriorNuisance" was chosen because we anticipate
300  // this to be auto-detected, but will leave the option open
301  // to force to a different prior for the nuisance parameters.
302 
303  // Part 3d : Construct and configure the HybridCalculator
304  //-------------------------------------------------------
305 
306  HybridCalculator hc1(*data, sb_model, b_model);
307  ToyMCSampler *toymcs1 = (ToyMCSampler*)hc1.GetTestStatSampler();
308  // toymcs1->SetNEventsPerToy(1); // because the model is in number counting form
309  toymcs1->SetTestStatistic(&eventCount); // set the test statistic
310  // toymcs1->SetGenerateBinned();
311  hc1.SetToys(30000,1000);
312  hc1.ForcePriorNuisanceAlt(*w->pdf("py"));
313  hc1.ForcePriorNuisanceNull(*w->pdf("py"));
314  // if you wanted to use the ad hoc Gaussian prior instead
315  // ~~~
316  // hc1.ForcePriorNuisanceAlt(*w->pdf("gauss_prior"));
317  // hc1.ForcePriorNuisanceNull(*w->pdf("gauss_prior"));
318  // ~~~
319  // if you wanted to use the ad hoc log-normal prior instead
320  // ~~~
321  // hc1.ForcePriorNuisanceAlt(*w->pdf("lognorm_prior"));
322  // hc1.ForcePriorNuisanceNull(*w->pdf("lognorm_prior"));
323  // ~~~
324 
325  // enable proof
326  // proof not enabled for this test statistic
327  // if(pc) toymcs1->SetProofConfig(pc);
328 
329  // these lines save current msg level and then kill any messages below ERROR
331  // Get the result
332  HypoTestResult *r1 = hc1.GetHypoTest();
333  RooMsgService::instance().setGlobalKillBelow(msglevel); // set it back
334  cout << "-----------------------------------------"<<endl;
335  cout << "Part 4" << endl;
336  r1->Print();
337  t.Stop(); t.Print(); t.Reset(); t.Start();
338 
339  c->cd(2);
340  HypoTestPlot *p1 = new HypoTestPlot(*r1,30); // 30 bins, TS is discrete
341  p1->Draw();
342 
343  return; // keep the running time sort by default
344  //-------------------------------------------------------------------------
345  // # P A R T 5 : U S I N G H Y B R I D C A L C U L A T O R W I T H
346  // # A N A L T E R N A T I V E T E S T S T A T I S T I C
347  //
348  // A likelihood ratio test statistics should be 1-to-1 with the count x
349  // when the value of b is fixed in the likelihood. This is implemented
350  // by the SimpleLikelihoodRatioTestStat
351 
352  SimpleLikelihoodRatioTestStat slrts(*b_model.GetPdf(),*sb_model.GetPdf());
353  slrts.SetNullParameters(*b_model.GetSnapshot());
354  slrts.SetAltParameters(*sb_model.GetSnapshot());
355 
356  // HYBRID CALCULATOR
357  HybridCalculator hc2(*data, sb_model, b_model);
358  ToyMCSampler *toymcs2 = (ToyMCSampler*)hc2.GetTestStatSampler();
359  // toymcs2->SetNEventsPerToy(1);
360  toymcs2->SetTestStatistic(&slrts);
361  // toymcs2->SetGenerateBinned();
362  hc2.SetToys(20000,1000);
363  hc2.ForcePriorNuisanceAlt(*w->pdf("py"));
364  hc2.ForcePriorNuisanceNull(*w->pdf("py"));
365  // if you wanted to use the ad hoc Gaussian prior instead
366  // ~~~
367  // hc2.ForcePriorNuisanceAlt(*w->pdf("gauss_prior"));
368  // hc2.ForcePriorNuisanceNull(*w->pdf("gauss_prior"));
369  // ~~~
370  // if you wanted to use the ad hoc log-normal prior instead
371  // ~~~
372  // hc2.ForcePriorNuisanceAlt(*w->pdf("lognorm_prior"));
373  // hc2.ForcePriorNuisanceNull(*w->pdf("lognorm_prior"));
374  // ~~~
375 
376  // enable proof
377  if(pc) toymcs2->SetProofConfig(pc);
378 
379  // these lines save current msg level and then kill any messages below ERROR
381  // Get the result
382  HypoTestResult *r2 = hc2.GetHypoTest();
383  cout << "-----------------------------------------"<<endl;
384  cout << "Part 5" << endl;
385  r2->Print();
386  t.Stop(); t.Print(); t.Reset(); t.Start();
388 
389  c->cd(3);
390  HypoTestPlot *p2 = new HypoTestPlot(*r2,30); // 30 bins
391  p2->Draw();
392 
393  return; // so standard tutorial runs faster
394 
395  //---------------------------------------------
396  // OUTPUT W/O PROOF (2.66 GHz Intel Core i7)
397  // ============================================
398 
399  // -----------------------------------------
400  // Part 3
401  // Z_Bi p-value (analytic): 0.00094165
402  // Z_Bi significance (analytic): 3.10804
403  // Real time 0:00:00, CP time 0.610
404 
405  // Results HybridCalculator_result:
406  // - Null p-value = 0.00103333 +/- 0.000179406
407  // - Significance = 3.08048 sigma
408  // - Number of S+B toys: 1000
409  // - Number of B toys: 30000
410  // - Test statistic evaluated on data: 150
411  // - CL_b: 0.998967 +/- 0.000185496
412  // - CL_s+b: 0.495 +/- 0.0158106
413  // - CL_s: 0.495512 +/- 0.0158272
414  // Real time 0:04:43, CP time 283.780
415 
416  // With PROOF
417  // -----------------------------------------
418  // Part 5
419 
420  // Results HybridCalculator_result:
421  // - Null p-value = 0.00105 +/- 0.000206022
422  // - Significance = 3.07571 sigma
423  // - Number of S+B toys: 1000
424  // - Number of B toys: 20000
425  // - Test statistic evaluated on data: 10.8198
426  // - CL_b: 0.99895 +/- 0.000229008
427  // - CL_s+b: 0.491 +/- 0.0158088
428  // - CL_s: 0.491516 +/- 0.0158258
429  // Real time 0:02:22, CP time 0.990
430 
431 
432  //-------------------------------------------------------
433  // Comparison
434  //-------------------------------------------------------
435  // LEPStatToolsForLHC
436  // https://plone4.fnal.gov:4430/P0/phystat/packages/0703002
437  // Uses Gaussian prior
438  // CL_b = 6.218476e-04, Significance = 3.228665 sigma
439  //
440  //-------------------------------------------------------
441  // Comparison
442  //-------------------------------------------------------
443  // Asymptotic
444  // From the value of the profile likelihood ratio (5.0338)
445  // The significance can be estimated using Wilks's theorem
446  // significance = sqrt(2*profileLR) = 3.1729 sigma
447 }
Holds configuration options for proof and proof-lite.
Definition: ProofConfig.h:46
ModelConfig is a simple class that holds configuration information specifying how a model should be u...
Definition: ModelConfig.h:30
void Start(Bool_t reset=kTRUE)
Start the stopwatch.
Definition: TStopwatch.cxx:58
void Print(Option_t *option="") const
Print the real and cpu time passed between the start and stop events.
Definition: TStopwatch.cxx:219
virtual const RooArgSet * get() const
Definition: RooAbsData.h:79
void Print(const Option_t *="") const
Print out some information about the results Note: use Alt/Null labels for the hypotheses here as the...
HypoTestResult is a base class for results from hypothesis tests.
TVirtualPad * cd(Int_t subpadnumber=0)
Set current canvas & pad.
Definition: TCanvas.cxx:688
RooFit::MsgLevel globalKillBelow() const
This class provides the plots for the result of a study performed with any of the HypoTestCalculatorG...
Definition: HypoTestPlot.h:22
static RooMsgService & instance()
Return reference to singleton instance.
Double_t BinomialWithTauObsP(Double_t nObs, Double_t bExp, Double_t tau)
virtual void SetTestStatistic(TestStatistic *testStatistic, unsigned int i)
Definition: ToyMCSampler.h:184
void Stop()
Stop the stopwatch.
Definition: TStopwatch.cxx:77
#define ClassDef(name, id)
Definition: Rtypes.h:320
void SetNullParameters(const RooArgSet &nullParameters)
static double p2(double t, double a, double b, double c)
void Draw(Option_t *options=0)
Draw this plot and all of the elements it contains.
Double_t BinomialWithTauObsZ(Double_t nObs, Double_t bExp, Double_t tau)
virtual void setVal(Double_t value)
Set value of variable to &#39;value&#39;.
Definition: RooRealVar.cxx:204
Bool_t defineSet(const char *name, const RooArgSet &aset, Bool_t importMissing=kFALSE)
Define a named RooArgSet with given constituents.
void SetProofConfig(ProofConfig *pc=NULL)
Definition: ToyMCSampler.h:233
void setGlobalKillBelow(RooFit::MsgLevel level)
RooAbsData is the common abstract base class for binned and unbinned datasets.
Definition: RooAbsData.h:37
static double p1(double t, double a, double b)
ToyMCSampler is an implementation of the TestStatSampler interface.
Definition: ToyMCSampler.h:71
RooDataSet is a container class to hold unbinned data.
Definition: RooDataSet.h:29
Same purpose as HybridCalculatorOriginal, but different implementation.
const Bool_t kFALSE
Definition: RtypesCore.h:88
The Canvas class.
Definition: TCanvas.h:31
Namespace for the RooStats classes.
Definition: Asimov.h:20
#define ClassImp(name)
Definition: Rtypes.h:359
RooAbsPdf * pdf(const char *name) const
Retrieve p.d.f (RooAbsPdf) with given name. A null pointer is returned if not found.
double Double_t
Definition: RtypesCore.h:55
TestStatistic class that returns -log(L[null] / L[alt]) where L is the likelihood.
RooRealVar * var(const char *name) const
Retrieve real-valued variable (RooRealVar) with given name. A null pointer is returned if not found...
RooFactoryWSTool & factory()
Return instance to factory tool.
virtual void Divide(Int_t nx=1, Int_t ny=1, Float_t xmargin=0.01, Float_t ymargin=0.01, Int_t color=0)
Automatic pad generation by division.
Definition: TPad.cxx:1162
Double_t getRealValue(const char *name, Double_t defVal=0, Bool_t verbose=kFALSE) const
Get value of a RooAbsReal stored in set with given name.
Definition: RooArgSet.cxx:526
RooDataSet * generate(const RooArgSet &whatVars, Int_t nEvents, const RooCmdArg &arg1, const RooCmdArg &arg2=RooCmdArg::none(), const RooCmdArg &arg3=RooCmdArg::none(), const RooCmdArg &arg4=RooCmdArg::none(), const RooCmdArg &arg5=RooCmdArg::none())
Generate a new dataset containing the specified variables with events sampled from our distribution...
Definition: RooAbsPdf.cxx:1725
const RooArgSet * set(const char *name)
Return pointer to previously defined named set with given nmame If no such set is found a null pointe...
#define c(i)
Definition: RSha256.hxx:101
void Reset()
Definition: TStopwatch.h:52
static constexpr double pc
TestStatistic is an interface class to provide a facility for construction test statistics distributi...
Definition: TestStatistic.h:31
The RooWorkspace is a persistable container for RooFit projects.
Definition: RooWorkspace.h:43
virtual Int_t numEntries() const
Definition: RooAbsData.cxx:285
Stopwatch class.
Definition: TStopwatch.h:28
NumEventsTestStat is a simple implementation of the TestStatistic interface used for simple number co...