Logo ROOT  
Reference Guide
 
Loading...
Searching...
No Matches
HybridStandardForm.C
Go to the documentation of this file.
1/// \file
2/// \ingroup tutorial_roostats
3/// \notebook -js
4/// A hypothesis testing example based on number counting with background uncertainty.
5///
6/// A hypothesis testing example based on number counting
7/// with background uncertainty.
8///
9/// NOTE: This example is like HybridInstructional, but the model is more clearly
10/// generalizable to an analysis with shapes. There is a lot of flexibility
11/// for how one models a problem in RooFit/RooStats. Models come in a few
12/// common forms:
13/// - standard form: extended PDF of some discriminating variable m:
14/// eg: P(m) ~ S*fs(m) + B*fb(m), with S+B events expected
15/// in this case the dataset has N rows corresponding to N events
16/// and the extended term is Pois(N|S+B)
17///
18/// - fractional form: non-extended PDF of some discriminating variable m:
19/// eg: P(m) ~ s*fs(m) + (1-s)*fb(m), where s is a signal fraction
20/// in this case the dataset has N rows corresponding to N events
21/// and there is no extended term
22///
23/// - number counting form: in which there is no discriminating variable
24/// and the counts are modeled directly (see HybridInstructional)
25/// eg: P(N) = Pois(N|S+B)
26/// in this case the dataset has 1 row corresponding to N events
27/// and the extended term is the PDF itself.
28///
29/// Here we convert the number counting form into the standard form by
30/// introducing a dummy discriminating variable m with a uniform distribution.
31///
32/// This example:
33/// - demonstrates the usage of the HybridCalcultor (Part 4-6)
34/// - demonstrates the numerical integration of RooFit (Part 2)
35/// - validates the RooStats against an example with a known analytic answer
36/// - demonstrates usage of different test statistics
37/// - explains subtle choices in the prior used for hybrid methods
38/// - demonstrates usage of different priors for the nuisance parameters
39/// - demonstrates usage of PROOF
40///
41/// The basic setup here is that a main measurement has observed x events with an
42/// expectation of s+b. One can choose an ad hoc prior for the uncertainty on b,
43/// or try to base it on an auxiliary measurement. In this case, the auxiliary
44/// measurement (aka control measurement, sideband) is another counting experiment
45/// with measurement y and expectation tau*b. With an 'original prior' on b,
46/// called \f$ \eta(b) \f$ then one can obtain a posterior from the auxiliary measurement
47/// \f$ \pi(b) = \eta(b) * Pois(y|tau*b) \f$. This is a principled choice for a prior
48/// on b in the main measurement of x, which can then be treated in a hybrid
49/// Bayesian/Frequentist way. Additionally, one can try to treat the two
50/// measurements simultaneously, which is detailed in Part 6 of the tutorial.
51///
52/// This tutorial is related to the FourBin.C tutorial in the modeling, but
53/// focuses on hypothesis testing instead of interval estimation.
54///
55/// More background on this 'prototype problem' can be found in the
56/// following papers:
57///
58/// - Evaluation of three methods for calculating statistical significance
59/// when incorporating a systematic uncertainty into a test of the
60/// background-only hypothesis for a Poisson process
61/// Authors: Robert D. Cousins, James T. Linnemann, Jordan Tucker
62/// http://arxiv.org/abs/physics/0702156
63/// NIM A 595 (2008) 480--501
64///
65/// - Statistical Challenges for Searches for New Physics at the LHC
66/// Author: Kyle Cranmer
67/// http://arxiv.org/abs/physics/0511028
68///
69/// - Measures of Significance in HEP and Astrophysics
70/// Author: J. T. Linnemann
71/// http://arxiv.org/abs/physics/0312059
72///
73/// \macro_image
74/// \macro_output
75/// \macro_code
76///
77/// \authors Kyle Cranmer, Wouter Verkerke, and Sven Kreiss
78
79#include "RooGlobalFunc.h"
80#include "RooRealVar.h"
81#include "RooProdPdf.h"
82#include "RooWorkspace.h"
83#include "RooDataSet.h"
84#include "RooDataHist.h"
85#include "TCanvas.h"
86#include "TStopwatch.h"
87#include "TH1.h"
88#include "RooPlot.h"
89#include "RooMsgService.h"
90
92
96
102
103using namespace RooFit;
104using namespace RooStats;
105
106//-------------------------------------------------------
107// A New Test Statistic Class for this example.
108// It simply returns the sum of the values in a particular
109// column of a dataset.
110// You can ignore this class and focus on the macro below
111
112class BinCountTestStat : public TestStatistic {
113public:
114 BinCountTestStat(void) : fColumnName("tmp") {}
115 BinCountTestStat(string columnName) : fColumnName(columnName) {}
116
117 virtual Double_t Evaluate(RooAbsData &data, RooArgSet & /*nullPOI*/)
118 {
119 // This is the main method in the interface
120 Double_t value = 0.0;
121 for (int i = 0; i < data.numEntries(); i++) {
122 value += data.get(i)->getRealValue(fColumnName.c_str());
123 }
124 return value;
125 }
126 virtual const TString GetVarName() const { return fColumnName; }
127
128private:
129 string fColumnName;
130
131protected:
132 ClassDef(BinCountTestStat, 1)
133};
134
135ClassImp(BinCountTestStat)
136
137 //-----------------------------
138 // The Actual Tutorial Macro
139 //-----------------------------
140
141 void HybridStandardForm()
142{
143
144 // This tutorial has 6 parts
145 // Table of Contents
146 // Setup
147 // 1. Make the model for the 'prototype problem'
148 // Special cases
149 // 2. NOT RELEVANT HERE
150 // 3. Use RooStats analytic solution for this problem
151 // RooStats HybridCalculator -- can be generalized
152 // 4. RooStats ToyMC version of 2. & 3.
153 // 5. RooStats ToyMC with an equivalent test statistic
154 // 6. RooStats ToyMC with simultaneous control & main measurement
155
156 // Part 4 takes ~4 min without PROOF.
157 // Part 5 takes about ~2 min with PROOF on 4 cores.
158 // Of course, everything looks nicer with more toys, which takes longer.
159
160 TStopwatch t;
161 t.Start();
162 TCanvas *c = new TCanvas;
163 c->Divide(2, 2);
164
165 //-----------------------------------------------------
166 // P A R T 1 : D I R E C T I N T E G R A T I O N
167 // ====================================================
168 // Make model for prototype on/off problem
169 // Pois(x | s+b) * Pois(y | tau b )
170 // for Z_Gamma, use uniform prior on b.
171 RooWorkspace *w = new RooWorkspace("w");
172
173 // replace the pdf in 'number counting form'
174 // w->factory("Poisson::px(x[150,0,500],sum::splusb(s[0,0,100],b[100,0,300]))");
175 // with one in standard form. Now x is encoded in event count
176 w->factory("Uniform::f(m[0,1])"); // m is a dummy discriminating variable
177 w->factory("ExtendPdf::px(f,sum::splusb(s[0,0,100],b[100,0,300]))");
178 w->factory("Poisson::py(y[100,0,500],prod::taub(tau[1.],b))");
179 w->factory("PROD::model(px,py)");
180 w->factory("Uniform::prior_b(b)");
181
182 // We will control the output level in a few places to avoid
183 // verbose progress messages. We start by keeping track
184 // of the current threshold on messages.
186
187 // Use PROOF-lite on multi-core machines
188 ProofConfig *pc = NULL;
189 // uncomment below if you want to use PROOF
190 pc = new ProofConfig(*w, 4, "workers=4", kFALSE); // machine with 4 cores
191 // pc = new ProofConfig(*w, 2, "workers=2", kFALSE); // machine with 2 cores
192
193 //-----------------------------------------------
194 // P A R T 3 : A N A L Y T I C R E S U L T
195 // ==============================================
196 // In this special case, the integrals are known analytically
197 // and they are implemented in RooStats::NumberCountingUtils
198
199 // analytic Z_Bi
200 double p_Bi = NumberCountingUtils::BinomialWithTauObsP(150, 100, 1);
201 double Z_Bi = NumberCountingUtils::BinomialWithTauObsZ(150, 100, 1);
202 cout << "-----------------------------------------" << endl;
203 cout << "Part 3" << endl;
204 std::cout << "Z_Bi p-value (analytic): " << p_Bi << std::endl;
205 std::cout << "Z_Bi significance (analytic): " << Z_Bi << std::endl;
206 t.Stop();
207 t.Print();
208 t.Reset();
209 t.Start();
210
211 //--------------------------------------------------------------
212 // P A R T 4 : U S I N G H Y B R I D C A L C U L A T O R
213 // ==============================================================
214 // Now we demonstrate the RooStats HybridCalculator.
215 //
216 // Like all RooStats calculators it needs the data and a ModelConfig
217 // for the relevant hypotheses. Since we are doing hypothesis testing
218 // we need a ModelConfig for the null (background only) and the alternate
219 // (signal+background) hypotheses. We also need to specify the PDF,
220 // the parameters of interest, and the observables. Furthermore, since
221 // the parameter of interest is floating, we need to specify which values
222 // of the parameter corresponds to the null and alternate (eg. s=0 and s=50)
223 //
224 // define some sets of variables obs={x} and poi={s}
225 // note here, x is the only observable in the main measurement
226 // and y is treated as a separate measurement, which is used
227 // to produce the prior that will be used in this calculation
228 // to randomize the nuisance parameters.
229 w->defineSet("obs", "m");
230 w->defineSet("poi", "s");
231
232 // create a toy dataset with the x=150
233 // RooDataSet *data = new RooDataSet("d", "d", *w->set("obs"));
234 // data->add(*w->set("obs"));
235 RooDataSet *data = w->pdf("px")->generate(*w->set("obs"), 150);
236
237 // Part 3a : Setup ModelConfigs
238 //-------------------------------------------------------
239 // create the null (background-only) ModelConfig with s=0
240 ModelConfig b_model("B_model", w);
241 b_model.SetPdf(*w->pdf("px"));
242 b_model.SetObservables(*w->set("obs"));
243 b_model.SetParametersOfInterest(*w->set("poi"));
244 w->var("s")->setVal(0.0); // important!
245 b_model.SetSnapshot(*w->set("poi"));
246
247 // create the alternate (signal+background) ModelConfig with s=50
248 ModelConfig sb_model("S+B_model", w);
249 sb_model.SetPdf(*w->pdf("px"));
250 sb_model.SetObservables(*w->set("obs"));
251 sb_model.SetParametersOfInterest(*w->set("poi"));
252 w->var("s")->setVal(50.0); // important!
253 sb_model.SetSnapshot(*w->set("poi"));
254
255 // Part 3b : Choose Test Statistic
256 //--------------------------------------------------------------
257 // To make an equivalent calculation we need to use x as the test
258 // statistic. This is not a built-in test statistic in RooStats
259 // so we define it above. The new class inherits from the
260 // RooStats::TestStatistic interface, and simply returns the value
261 // of x in the dataset.
262
263 NumEventsTestStat eventCount(*w->pdf("px"));
264
265 // Part 3c : Define Prior used to randomize nuisance parameters
266 //-------------------------------------------------------------
267 //
268 // The prior used for the hybrid calculator is the posterior
269 // from the auxiliary measurement y. The model for the aux.
270 // measurement is Pois(y|tau*b), thus the likelihood function
271 // is proportional to (has the form of) a Gamma distribution.
272 // if the 'original prior' $\eta(b)$ is uniform, then from
273 // Bayes's theorem we have the posterior:
274 // $\pi(b) = Pois(y|tau*b) * \eta(b)$
275 // If $\eta(b)$ is flat, then we arrive at a Gamma distribution.
276 // Since RooFit will normalize the PDF we can actually supply
277 // py=Pois(y,tau*b) that will be equivalent to multiplying by a uniform.
278 //
279 // Alternatively, we could explicitly use a gamma distribution:
280 //
281 // `w->factory("Gamma::gamma(b,sum::temp(y,1),1,0)");`
282 //
283 // or we can use some other ad hoc prior that do not naturally
284 // follow from the known form of the auxiliary measurement.
285 // The common choice is the equivalent Gaussian:
286 w->factory("Gaussian::gauss_prior(b,y, expr::sqrty('sqrt(y)',y))");
287 // this corresponds to the "Z_N" calculation.
288 //
289 // or one could use the analogous log-normal prior
290 w->factory("Lognormal::lognorm_prior(b,y, expr::kappa('1+1./sqrt(y)',y))");
291 //
292 // Ideally, the HybridCalculator would be able to inspect the full
293 // model Pois(x | s+b) * Pois(y | tau b ) and be given the original
294 // prior $\eta(b)$ to form $\pi(b) = Pois(y|tau*b) * \eta(b)$.
295 // This is not yet implemented because in the general case
296 // it is not easy to identify the terms in the PDF that correspond
297 // to the auxiliary measurement. So for now, it must be set
298 // explicitly with:
299 // - ForcePriorNuisanceNull()
300 // - ForcePriorNuisanceAlt()
301 // the name "ForcePriorNuisance" was chosen because we anticipate
302 // this to be auto-detected, but will leave the option open
303 // to force to a different prior for the nuisance parameters.
304
305 // Part 3d : Construct and configure the HybridCalculator
306 //-------------------------------------------------------
307
308 HybridCalculator hc1(*data, sb_model, b_model);
309 ToyMCSampler *toymcs1 = (ToyMCSampler *)hc1.GetTestStatSampler();
310 // toymcs1->SetNEventsPerToy(1); // because the model is in number counting form
311 toymcs1->SetTestStatistic(&eventCount); // set the test statistic
312 // toymcs1->SetGenerateBinned();
313 hc1.SetToys(30000, 1000);
314 hc1.ForcePriorNuisanceAlt(*w->pdf("py"));
315 hc1.ForcePriorNuisanceNull(*w->pdf("py"));
316 // if you wanted to use the ad hoc Gaussian prior instead
317 // ~~~
318 // hc1.ForcePriorNuisanceAlt(*w->pdf("gauss_prior"));
319 // hc1.ForcePriorNuisanceNull(*w->pdf("gauss_prior"));
320 // ~~~
321 // if you wanted to use the ad hoc log-normal prior instead
322 // ~~~
323 // hc1.ForcePriorNuisanceAlt(*w->pdf("lognorm_prior"));
324 // hc1.ForcePriorNuisanceNull(*w->pdf("lognorm_prior"));
325 // ~~~
326
327 // enable proof
328 // proof not enabled for this test statistic
329 // if(pc) toymcs1->SetProofConfig(pc);
330
331 // these lines save current msg level and then kill any messages below ERROR
333 // Get the result
334 HypoTestResult *r1 = hc1.GetHypoTest();
335 RooMsgService::instance().setGlobalKillBelow(msglevel); // set it back
336 cout << "-----------------------------------------" << endl;
337 cout << "Part 4" << endl;
338 r1->Print();
339 t.Stop();
340 t.Print();
341 t.Reset();
342 t.Start();
343
344 c->cd(2);
345 HypoTestPlot *p1 = new HypoTestPlot(*r1, 30); // 30 bins, TS is discrete
346 p1->Draw();
347
348 return; // keep the running time sort by default
349 //-------------------------------------------------------------------------
350 // # P A R T 5 : U S I N G H Y B R I D C A L C U L A T O R W I T H
351 // # A N A L T E R N A T I V E T E S T S T A T I S T I C
352 //
353 // A likelihood ratio test statistics should be 1-to-1 with the count x
354 // when the value of b is fixed in the likelihood. This is implemented
355 // by the SimpleLikelihoodRatioTestStat
356
357 SimpleLikelihoodRatioTestStat slrts(*b_model.GetPdf(), *sb_model.GetPdf());
358 slrts.SetNullParameters(*b_model.GetSnapshot());
359 slrts.SetAltParameters(*sb_model.GetSnapshot());
360
361 // HYBRID CALCULATOR
362 HybridCalculator hc2(*data, sb_model, b_model);
363 ToyMCSampler *toymcs2 = (ToyMCSampler *)hc2.GetTestStatSampler();
364 // toymcs2->SetNEventsPerToy(1);
365 toymcs2->SetTestStatistic(&slrts);
366 // toymcs2->SetGenerateBinned();
367 hc2.SetToys(20000, 1000);
368 hc2.ForcePriorNuisanceAlt(*w->pdf("py"));
369 hc2.ForcePriorNuisanceNull(*w->pdf("py"));
370 // if you wanted to use the ad hoc Gaussian prior instead
371 // ~~~
372 // hc2.ForcePriorNuisanceAlt(*w->pdf("gauss_prior"));
373 // hc2.ForcePriorNuisanceNull(*w->pdf("gauss_prior"));
374 // ~~~
375 // if you wanted to use the ad hoc log-normal prior instead
376 // ~~~
377 // hc2.ForcePriorNuisanceAlt(*w->pdf("lognorm_prior"));
378 // hc2.ForcePriorNuisanceNull(*w->pdf("lognorm_prior"));
379 // ~~~
380
381 // enable proof
382 if (pc)
383 toymcs2->SetProofConfig(pc);
384
385 // these lines save current msg level and then kill any messages below ERROR
387 // Get the result
388 HypoTestResult *r2 = hc2.GetHypoTest();
389 cout << "-----------------------------------------" << endl;
390 cout << "Part 5" << endl;
391 r2->Print();
392 t.Stop();
393 t.Print();
394 t.Reset();
395 t.Start();
397
398 c->cd(3);
399 HypoTestPlot *p2 = new HypoTestPlot(*r2, 30); // 30 bins
400 p2->Draw();
401
402 return; // so standard tutorial runs faster
403
404 //---------------------------------------------
405 // OUTPUT W/O PROOF (2.66 GHz Intel Core i7)
406 // ============================================
407
408 // -----------------------------------------
409 // Part 3
410 // Z_Bi p-value (analytic): 0.00094165
411 // Z_Bi significance (analytic): 3.10804
412 // Real time 0:00:00, CP time 0.610
413
414 // Results HybridCalculator_result:
415 // - Null p-value = 0.00103333 +/- 0.000179406
416 // - Significance = 3.08048 sigma
417 // - Number of S+B toys: 1000
418 // - Number of B toys: 30000
419 // - Test statistic evaluated on data: 150
420 // - CL_b: 0.998967 +/- 0.000185496
421 // - CL_s+b: 0.495 +/- 0.0158106
422 // - CL_s: 0.495512 +/- 0.0158272
423 // Real time 0:04:43, CP time 283.780
424
425 // With PROOF
426 // -----------------------------------------
427 // Part 5
428
429 // Results HybridCalculator_result:
430 // - Null p-value = 0.00105 +/- 0.000206022
431 // - Significance = 3.07571 sigma
432 // - Number of S+B toys: 1000
433 // - Number of B toys: 20000
434 // - Test statistic evaluated on data: 10.8198
435 // - CL_b: 0.99895 +/- 0.000229008
436 // - CL_s+b: 0.491 +/- 0.0158088
437 // - CL_s: 0.491516 +/- 0.0158258
438 // Real time 0:02:22, CP time 0.990
439
440 //-------------------------------------------------------
441 // Comparison
442 //-------------------------------------------------------
443 // LEPStatToolsForLHC
444 // https://plone4.fnal.gov:4430/P0/phystat/packages/0703002
445 // Uses Gaussian prior
446 // CL_b = 6.218476e-04, Significance = 3.228665 sigma
447 //
448 //-------------------------------------------------------
449 // Comparison
450 //-------------------------------------------------------
451 // Asymptotic
452 // From the value of the profile likelihood ratio (5.0338)
453 // The significance can be estimated using Wilks's theorem
454 // significance = sqrt(2*profileLR) = 3.1729 sigma
455}
#define c(i)
Definition RSha256.hxx:101
const Bool_t kFALSE
Definition RtypesCore.h:92
double Double_t
Definition RtypesCore.h:59
#define ClassDef(name, id)
Definition Rtypes.h:325
#define ClassImp(name)
Definition Rtypes.h:364
Double_t getRealValue(const char *name, Double_t defVal=0, Bool_t verbose=kFALSE) const
Get value of a RooAbsReal stored in set with given name.
RooAbsData is the common abstract base class for binned and unbinned datasets.
Definition RooAbsData.h:49
virtual const RooArgSet * get() const
Definition RooAbsData.h:92
virtual Int_t numEntries() const
Return number of entries in dataset, i.e., count unweighted entries.
RooDataSet * generate(const RooArgSet &whatVars, Int_t nEvents, const RooCmdArg &arg1, const RooCmdArg &arg2=RooCmdArg::none(), const RooCmdArg &arg3=RooCmdArg::none(), const RooCmdArg &arg4=RooCmdArg::none(), const RooCmdArg &arg5=RooCmdArg::none())
See RooAbsPdf::generate(const RooArgSet&,const RooCmdArg&,const RooCmdArg&,const RooCmdArg&,...
Definition RooAbsPdf.h:58
RooArgSet is a container object that can hold multiple RooAbsArg objects.
Definition RooArgSet.h:29
RooDataSet is a container class to hold unbinned data.
Definition RooDataSet.h:33
static RooMsgService & instance()
Return reference to singleton instance.
void setGlobalKillBelow(RooFit::MsgLevel level)
RooFit::MsgLevel globalKillBelow() const
virtual void setVal(Double_t value)
Set value of variable to 'value'.
Same purpose as HybridCalculatorOriginal, but different implementation.
This class provides the plots for the result of a study performed with any of the HypoTestCalculatorG...
HypoTestResult is a base class for results from hypothesis tests.
void Print(const Option_t *="") const
Print out some information about the results Note: use Alt/Null labels for the hypotheses here as the...
ModelConfig is a simple class that holds configuration information specifying how a model should be u...
Definition ModelConfig.h:30
NumEventsTestStat is a simple implementation of the TestStatistic interface used for simple number co...
Holds configuration options for proof and proof-lite.
Definition ProofConfig.h:46
void Draw(Option_t *options=0)
Draw this plot and all of the elements it contains.
TestStatistic class that returns -log(L[null] / L[alt]) where L is the likelihood.
TestStatistic is an interface class to provide a facility for construction test statistics distributi...
ToyMCSampler is an implementation of the TestStatSampler interface.
void SetProofConfig(ProofConfig *pc=NULL)
virtual void SetTestStatistic(TestStatistic *testStatistic, unsigned int i)
The RooWorkspace is a persistable container for RooFit projects.
Bool_t defineSet(const char *name, const RooArgSet &aset, Bool_t importMissing=kFALSE)
Define a named RooArgSet with given constituents.
RooRealVar * var(const char *name) const
Retrieve real-valued variable (RooRealVar) with given name. A null pointer is returned if not found.
RooFactoryWSTool & factory()
Return instance to factory tool.
const RooArgSet * set(const char *name)
Return pointer to previously defined named set with given nmame If no such set is found a null pointe...
RooAbsPdf * pdf(const char *name) const
Retrieve p.d.f (RooAbsPdf) with given name. A null pointer is returned if not found.
The Canvas class.
Definition TCanvas.h:23
void Divide(Int_t nx=1, Int_t ny=1, Float_t xmargin=0.01, Float_t ymargin=0.01, Int_t color=0) override
Automatic pad generation by division.
Definition TPad.cxx:1177
Stopwatch class.
Definition TStopwatch.h:28
void Start(Bool_t reset=kTRUE)
Start the stopwatch.
void Stop()
Stop the stopwatch.
void Reset()
Definition TStopwatch.h:52
void Print(Option_t *option="") const
Print the real and cpu time passed between the start and stop events.
Basic string class.
Definition TString.h:136
The namespace RooFit contains mostly switches that change the behaviour of functions of PDFs (or othe...
MsgLevel
Verbosity level for RooMsgService::StreamConfig in RooMsgService.
Namespace for the RooStats classes.
Definition Asimov.h:19