ROOT logo
ROOT » MATH » PHYSICS » TRobustEstimator

class TRobustEstimator: public TObject


  TRobustEstimator

 Minimum Covariance Determinant Estimator - a Fast Algorithm
 invented by Peter J.Rousseeuw and Katrien Van Dreissen
 "A Fast Algorithm for the Minimum covariance Determinant Estimator"
 Technometrics, August 1999, Vol.41, NO.3

 What are robust estimators?
 "An important property of an estimator is its robustness. An estimator
 is called robust if it is insensitive to measurements that deviate
 from the expected behaviour. There are 2 ways to treat such deviating
 measurements: one may either try to recongize them and then remove
 them from the data sample; or one may leave them in the sample, taking
 care that they do not influence the estimate unduly. In both cases robust
 estimators are needed...Robust procedures compensate for systematic errors
 as much as possible, and indicate any situation in which a danger of not being
 able to operate reliably is detected."
 R.Fruhwirth, M.Regler, R.K.Bock, H.Grote, D.Notz
 "Data Analysis Techniques for High-Energy Physics", 2nd edition

 What does this algorithm do?
 It computes a highly robust estimator of multivariate location and scatter.
 Then, it takes those estimates to compute robust distances of all the
 data vectors. Those with large robust distances are considered outliers.
 Robust distances can then be plotted for better visualization of the data.

 How does this algorithm do it?
 The MCD objective is to find h observations(out of n) whose classical
 covariance matrix has the lowest determinant. The MCD estimator of location
 is then the average of those h points and the MCD estimate of scatter
 is their covariance matrix. The minimum(and default) h = (n+nvariables+1)/2
 so the algorithm is effective when less than (n+nvar+1)/2 variables are outliers.
 The algorithm also allows for exact fit situations - that is, when h or more
 observations lie on a hyperplane. Then the algorithm still yields the MCD location T
 and scatter matrix S, the latter being singular as it should be. From (T,S) the
 program then computes the equation of the hyperplane.

 How can this algorithm be used?
 In any case, when contamination of data is suspected, that might influence
 the classical estimates.
 Also, robust estimation of location and scatter is a tool to robustify
 other multivariate techniques such as, for example, principal-component analysis
 and discriminant analysis.




 Technical details of the algorithm:
 0.The default h = (n+nvariables+1)/2, but the user may choose any interger h with
   (n+nvariables+1)/2<=h<=n. The program then reports the MCD's breakdown value
   (n-h+1)/n. If you are sure that the dataset contains less than 25% contamination
   which is usually the case, a good compromise between breakdown value and
  efficiency is obtained by putting h=[.75*n].
 1.If h=n,the MCD location estimate is the average of the whole dataset, and
   the MCD scatter estimate is its covariance matrix. Report this and stop
 2.If nvariables=1 (univariate data), compute the MCD estimate by the exact
   algorithm of Rousseeuw and Leroy (1987, pp.171-172) in O(nlogn)time and stop
 3.From here on, h<n and nvariables>=2.
   3a.If n is small:
    - repeat (say) 500 times:
    -- construct an initial h-subset, starting from a random (nvar+1)-subset
    -- carry out 2 C-steps (described in the comments of CStep function)
    - for the 10 results with lowest det(S):
    -- carry out C-steps until convergence
    - report the solution (T, S) with the lowest det(S)
   3b.If n is larger (say, n>600), then
    - construct up to 5 disjoint random subsets of size nsub (say, nsub=300)
    - inside each subset repeat 500/5 times:
    -- construct an initial subset of size hsub=[nsub*h/n]
    -- carry out 2 C-steps
    -- keep the best 10 results (Tsub, Ssub)
    - pool the subsets, yielding the merged set (say, of size nmerged=1500)
    - in the merged set, repeat for each of the 50 solutions (Tsub, Ssub)
    -- carry out 2 C-steps
    -- keep the 10 best results
    - in the full dataset, repeat for those best results:
    -- take several C-steps, using n and h
    -- report the best final result (T, S)
 4.To obtain consistency when the data comes from a multivariate normal
   distribution, covariance matrix is multiplied by a correction factor
 5.Robust distances for all elements, using the final (T, S) are calculated
   Then the very final mean and covariance estimates are calculated only for
   values, whose robust distances are less than a cutoff value (0.975 quantile
   of chi2 distribution with nvariables degrees of freedom)


Function Members (Methods)

public:
TRobustEstimator()
TRobustEstimator(const TRobustEstimator&)
TRobustEstimator(Int_t nvectors, Int_t nvariables, Int_t hh = 0)
virtual~TRobustEstimator()
voidTObject::AbstractMethod(const char* method) const
voidAddColumn(Double_t* col)
voidAddRow(Double_t* row)
virtual voidTObject::AppendPad(Option_t* option = "")
virtual voidTObject::Browse(TBrowser* b)
static TClass*Class()
virtual const char*TObject::ClassName() const
virtual voidTObject::Clear(Option_t* = "")
virtual TObject*TObject::Clone(const char* newname = "") const
virtual Int_tTObject::Compare(const TObject* obj) const
virtual voidTObject::Copy(TObject& object) const
virtual voidTObject::Delete(Option_t* option = "")MENU
virtual Int_tTObject::DistancetoPrimitive(Int_t px, Int_t py)
virtual voidTObject::Draw(Option_t* option = "")
virtual voidTObject::DrawClass() constMENU
virtual TObject*TObject::DrawClone(Option_t* option = "") constMENU
virtual voidTObject::Dump() constMENU
virtual voidTObject::Error(const char* method, const char* msgfmt) const
voidEvaluate()
voidEvaluateUni(Int_t nvectors, Double_t* data, Double_t& mean, Double_t& sigma, Int_t hh = 0)
virtual voidTObject::Execute(const char* method, const char* params, Int_t* error = 0)
virtual voidTObject::Execute(TMethod* method, TObjArray* params, Int_t* error = 0)
virtual voidTObject::ExecuteEvent(Int_t event, Int_t px, Int_t py)
virtual voidTObject::Fatal(const char* method, const char* msgfmt) const
virtual TObject*TObject::FindObject(const char* name) const
virtual TObject*TObject::FindObject(const TObject* obj) const
Int_tGetBDPoint()
Double_tGetChiQuant(Int_t i) const
const TMatrixDSym*GetCorrelation() const
voidGetCorrelation(TMatrixDSym& matr)
const TMatrixDSym*GetCovariance() const
voidGetCovariance(TMatrixDSym& matr)
virtual Option_t*TObject::GetDrawOption() const
static Long_tTObject::GetDtorOnly()
const TVectorD*GetHyperplane() const
voidGetHyperplane(TVectorD& vec)
virtual const char*TObject::GetIconName() const
const TVectorD*GetMean() const
voidGetMean(TVectorD& means)
virtual const char*TObject::GetName() const
Int_tGetNHyp()
Int_tGetNOut()
Int_tGetNumberObservations() const
Int_tGetNvar() const
virtual char*TObject::GetObjectInfo(Int_t px, Int_t py) const
static Bool_tTObject::GetObjectStat()
virtual Option_t*TObject::GetOption() const
const TArrayI*GetOuliers() const
const TVectorD*GetRDistances() const
voidGetRDistances(TVectorD& rdist)
virtual const char*TObject::GetTitle() const
virtual UInt_tTObject::GetUniqueID() const
virtual Bool_tTObject::HandleTimer(TTimer* timer)
virtual ULong_tTObject::Hash() const
virtual voidTObject::Info(const char* method, const char* msgfmt) const
virtual Bool_tTObject::InheritsFrom(const char* classname) const
virtual Bool_tTObject::InheritsFrom(const TClass* cl) const
virtual voidTObject::Inspect() constMENU
voidTObject::InvertBit(UInt_t f)
virtual TClass*IsA() const
virtual Bool_tTObject::IsEqual(const TObject* obj) const
virtual Bool_tTObject::IsFolder() const
Bool_tTObject::IsOnHeap() const
virtual Bool_tTObject::IsSortable() const
Bool_tTObject::IsZombie() const
virtual voidTObject::ls(Option_t* option = "") const
voidTObject::MayNotUse(const char* method) const
virtual Bool_tTObject::Notify()
static voidTObject::operator delete(void* ptr)
static voidTObject::operator delete(void* ptr, void* vp)
static voidTObject::operator delete[](void* ptr)
static voidTObject::operator delete[](void* ptr, void* vp)
void*TObject::operator new(size_t sz)
void*TObject::operator new(size_t sz, void* vp)
void*TObject::operator new[](size_t sz)
void*TObject::operator new[](size_t sz, void* vp)
TRobustEstimator&operator=(const TRobustEstimator&)
virtual voidTObject::Paint(Option_t* option = "")
virtual voidTObject::Pop()
virtual voidTObject::Print(Option_t* option = "") const
virtual Int_tTObject::Read(const char* name)
virtual voidTObject::RecursiveRemove(TObject* obj)
voidTObject::ResetBit(UInt_t f)
virtual voidTObject::SaveAs(const char* filename = "", Option_t* option = "") constMENU
virtual voidTObject::SavePrimitive(basic_ostream<char,char_traits<char> >& out, Option_t* option = "")
voidTObject::SetBit(UInt_t f)
voidTObject::SetBit(UInt_t f, Bool_t set)
virtual voidTObject::SetDrawOption(Option_t* option = "")MENU
static voidTObject::SetDtorOnly(void* obj)
static voidTObject::SetObjectStat(Bool_t stat)
virtual voidTObject::SetUniqueID(UInt_t uid)
virtual voidShowMembers(TMemberInspector& insp, char* parent)
virtual voidStreamer(TBuffer& b)
voidStreamerNVirtual(TBuffer& b)
virtual voidTObject::SysError(const char* method, const char* msgfmt) const
Bool_tTObject::TestBit(UInt_t f) const
Int_tTObject::TestBits(UInt_t f) const
virtual voidTObject::UseCurrentStyle()
virtual voidTObject::Warning(const char* method, const char* msgfmt) const
virtual Int_tTObject::Write(const char* name = 0, Int_t option = 0, Int_t bufsize = 0)
virtual Int_tTObject::Write(const char* name = 0, Int_t option = 0, Int_t bufsize = 0) const
protected:
voidAddToSscp(TMatrixD& sscp, TVectorD& vec)
voidClassic()
voidClearSscp(TMatrixD& sscp)
voidCorrel()
voidCovar(TMatrixD& sscp, TVectorD& m, TMatrixDSym& cov, TVectorD& sd, Int_t nvec)
voidCreateOrtSubset(TMatrixD& dat, Int_t* index, Int_t hmerged, Int_t nmerged, TMatrixD& sscp, Double_t* ndist)
voidCreateSubset(Int_t ntotal, Int_t htotal, Int_t p, Int_t* index, TMatrixD& data, TMatrixD& sscp, Double_t* ndist)
Double_tCStep(Int_t ntotal, Int_t htotal, Int_t* index, TMatrixD& data, TMatrixD& sscp, Double_t* ndist)
virtual voidTObject::DoError(int level, const char* location, const char* fmt, va_list va) const
Int_tExact(Double_t* ndist)
Int_tExact2(TMatrixD& mstockbig, TMatrixD& cstockbig, TMatrixD& hyperplane, Double_t* deti, Int_t nbest, Int_t kgroup, TMatrixD& sscp, Double_t* ndist)
Double_tKOrdStat(Int_t ntotal, Double_t* arr, Int_t k, Int_t* work)
voidTObject::MakeZombie()
Int_tPartition(Int_t nmini, Int_t* indsubdat)
Int_tRDist(TMatrixD& sscp)
voidRDraw(Int_t* subdat, Int_t ngroup, Int_t* indsubdat)

Data Members

protected:
TMatrixDSymfCorrelationcorrelation matrix
TMatrixDSymfCovariancecovariance matrix estimate
TMatrixDfDatathe original data
Int_tfExactif there was an exact fit, stores the number of points on a hyperplane
Int_tfHalgorithm parameter, determining the subsample size
TVectorDfHyperplanein case more than fH observations lie on a hyperplane
TMatrixDSymfInvcovarianceinverse of the covariance matrix
TVectorDfMeanlocation estimate (mean values)
Int_tfNnumber of observations
Int_tfNvarnumber of variables
TArrayIfOutarray of indexes of ouliers, size <0.5*n
TVectorDfRdarray of robust distances, size n
TVectorDfSdarray of standard deviations
Int_tfVarTempnumber of variables already added to the data matrix
Int_tfVecTempnumber of observations already added to the data matrix

Class Charts

Inheritance Inherited Members Includes Libraries
Class Charts

Function documentation

TRobustEstimator()
this constructor should be used in a univariate case:
first call this constructor, then - the EvaluateUni(..) function
TRobustEstimator(Int_t nvectors, Int_t nvariables, Int_t hh = 0)
constructor
void AddColumn(Double_t* col)
adds a column to the data matrix
it is assumed that the column has size fN
variable fVarTemp keeps the number of columns l
already added
void AddRow(Double_t* row)
adds a vector to the data matrix
it is supposed that the vector is of size fNvar
void Evaluate()
Finds the estimate of multivariate mean and variance
void EvaluateUni(Int_t nvectors, Double_t* data, Double_t& mean, Double_t& sigma, Int_t hh = 0)
for the univariate case
estimates of location and scatter are returned in mean and sigma parameters
the algorithm works on the same principle as in multivariate case -
it finds a subset of size hh with smallest sigma, and then returns mean and
sigma of this subset
Int_t GetBDPoint()
returns the breakdown point of the algorithm
Double_t GetChiQuant(Int_t i) const
returns the chi2 quantiles
void GetCovariance(TMatrixDSym& matr)
returns the covariance matrix
void GetCorrelation(TMatrixDSym& matr)
returns the correlation matrix
const TVectorD* GetHyperplane() const
if the points are on a hyperplane, returns this hyperplane
void GetHyperplane(TVectorD& vec)
if the points are on a hyperplane, returns this hyperplane
void GetMean(TVectorD& means)
return the estimate of the mean
void GetRDistances(TVectorD& rdist)
returns the robust distances (helps to find outliers)
Int_t GetNOut()
returns the number of outliers
void AddToSscp(TMatrixD& sscp, TVectorD& vec)
update the sscp matrix with vector vec
void ClearSscp(TMatrixD& sscp)
clear the sscp matrix, used for covariance and mean calculation
void Classic()
called when h=n. Returns classic covariance matrix
and mean
void Covar(TMatrixD& sscp, TVectorD& m, TMatrixDSym& cov, TVectorD& sd, Int_t nvec)
calculates mean and covariance
void Correl()
transforms covariance matrix into correlation matrix
void CreateSubset(Int_t ntotal, Int_t htotal, Int_t p, Int_t* index, TMatrixD& data, TMatrixD& sscp, Double_t* ndist)
creates a subset of htotal elements from ntotal elements
first, p+1 elements are drawn randomly(without repetitions)
if their covariance matrix is singular, more elements are
added one by one, until their covariance matrix becomes regular
or it becomes clear that htotal observations lie on a hyperplane
If covariance matrix determinant!=0, distances of all ntotal elements
are calculated, using formula d_i=Sqrt((x_i-M)*S_inv*(x_i-M)), where
M is mean and S_inv is the inverse of the covariance matrix
htotal points with smallest distances are included in the returned subset.
void CreateOrtSubset(TMatrixD& dat, Int_t* index, Int_t hmerged, Int_t nmerged, TMatrixD& sscp, Double_t* ndist)
creates a subset of hmerged vectors with smallest orthogonal distances to the hyperplane
hyp[1]*(x1-mean[1])+...+hyp[nvar]*(xnvar-mean[nvar])=0
This function is called in case when less than fH samples lie on a hyperplane.
Double_t CStep(Int_t ntotal, Int_t htotal, Int_t* index, TMatrixD& data, TMatrixD& sscp, Double_t* ndist)
from the input htotal-subset constructs another htotal subset with lower determinant

As proven by Peter J.Rousseeuw and Katrien Van Driessen, if distances for all elements
are calculated, using the formula:d_i=Sqrt((x_i-M)*S_inv*(x_i-M)), where M is the mean
of the input htotal-subset, and S_inv - the inverse of its covariance matrix, then
htotal elements with smallest distances will have covariance matrix with determinant
less or equal to the determinant of the input subset covariance matrix.

determinant for this htotal-subset with smallest distances is returned
Int_t Exact(Double_t* ndist)
for the exact fit situaions
returns number of observations on the hyperplane
Int_t Exact2(TMatrixD& mstockbig, TMatrixD& cstockbig, TMatrixD& hyperplane, Double_t* deti, Int_t nbest, Int_t kgroup, TMatrixD& sscp, Double_t* ndist)
This function is called if determinant of the covariance matrix of a subset=0.

If there are more then fH vectors on a hyperplane,
returns this hyperplane and stops
else stores the hyperplane coordinates in hyperplane matrix
Int_t Partition(Int_t nmini, Int_t* indsubdat)
divides the elements into approximately equal subgroups
number of elements in each subgroup is stored in indsubdat
number of subgroups is returned
Int_t RDist(TMatrixD& sscp)
Calculates robust distances.Then the samples with robust distances
greater than a cutoff value (0.975 quantile of chi2 distribution with
fNvar degrees of freedom, multiplied by a correction factor), are given
weiht=0, and new, reweighted estimates of location and scatter are calculated
The function returns the number of outliers.
void RDraw(Int_t* subdat, Int_t ngroup, Int_t* indsubdat)
Draws ngroup nonoverlapping subdatasets out of a dataset of size n
such that the selected case numbers are uniformly distributed from 1 to n
Double_t KOrdStat(Int_t ntotal, Double_t* arr, Int_t k, Int_t* work)
because I need an Int_t work array
TRobustEstimator()
virtual ~TRobustEstimator()
{;}
void GetCovariance(TMatrixDSym& matr)
void GetCorrelation(TMatrixDSym& matr)
Int_t GetNHyp()
{return fExact;}
void GetMean(TVectorD& means)
void GetRDistances(TVectorD& rdist)
Int_t GetNumberObservations() const
{return fN;}
Int_t GetNvar() const
{return fNvar;}
const TArrayI* GetOuliers() const
{return &fOut;}