class TPrincipal: public TNamed

Principal Components Analysis (PCA)

The current implementation is based on the LINTRA package from CERNLIB by R. Brun, H. Hansroul, and J. Kubler. The class has been implemented by Christian Holm Christensen in August 2000.

Introduction

In many applications of various fields of research, the treatment of large amounts of data requires powerful techniques capable of rapid data reduction and analysis. Usually, the quantities most conveniently measured by the experimentalist, are not necessarily the most significant for classification and analysis of the data. It is then useful to have a way of selecting an optimal set of variables necessary for the recognition process and reducing the dimensionality of the problem, resulting in an easier classification procedure.

This paper describes the implementation of one such method of feature selection, namely the principal components analysis. This multidimensional technique is well known in the field of pattern recognition and and its use in Particle Physics has been documented elsewhere (cf. H. Wind, Function Parameterization, CERN 72-21).

Overview

Suppose we have prototypes which are trajectories of particles, passing through a spectrometer. If one measures the passage of the particle at say 8 fixed planes, the trajectory is described by an 8-component vector:

$\begin{displaymath} \mathbf{x} = \left(x_0, x_1, \ldots, x_7\right) \end{displaymath}$

in 8-dimensional pattern space.

One proceeds by generating a a representative tracks sample and building up the covariance matrix $\mathsf{C}$ . Its eigenvectors and eigenvalues are computed by standard methods, and thus a new basis is obtained for the original 8-dimensional space the expansion of the prototypes,

$\begin{displaymath} \mathbf{x}_m = \sum^7_{i=0} a_{m_i} \mathbf{e}_i \quad \mbox{where} \quad a_{m_i} = \mathbf{x}^T\bullet\mathbf{e}_i \end{displaymath}$

allows the study of the behavior of the coefficients $a_{m_i}$ for all the tracks of the sample. The eigenvectors which are insignificant for the trajectory description in the expansion will have their corresponding coefficients $a_{m_i}$ close to zero for all the prototypes.

On one hand, a reduction of the dimensionality is then obtained by omitting these least significant vectors in the subsequent analysis.

On the other hand, in the analysis of real data, these least significant variables(?) can be used for the pattern recognition problem of extracting the valid combinations of coordinates describing a true trajectory from the set of all possible wrong combinations.

The program described here performs this principal components analysis on a sample of data provided by the user. It computes the covariance matrix, its eigenvalues ands corresponding eigenvectors and exhibits the behavior of the principal components ( $a_{m_i}$ ), thus providing to the user all the means of understanding his data.

A short outline of the method of Principal Components is given in subsection 1.3.

Principal Components Method

Let's consider a sample of prototypes each being characterized by variables $x_0, x_1, \ldots, x_{P-1}$ . Each prototype is a point, or a column vector, in a -dimensional pattern space.

$\begin{displaymath} \mathbf{x} = \left[\begin{array}{c} x_0\\ x_1\\ \vdots\\ x_{P-1}\end{array}\right]\,, \end{displaymath}$

(1)

where each

represents the particular value associated with the

-dimension.

Those variables are the quantities accessible to the experimentalist, but are not necessarily the most significant for the classification purpose.

The Principal Components Method consists of applying a linear transformation to the original variables. This transformation is described by an orthogonal matrix and is equivalent to a rotation of the original pattern space into a new set of coordinate vectors, which hopefully provide easier feature identification and dimensionality reduction.

Let's define the covariance matrix:

$\begin{displaymath} \mathsf{C} = \left\langle\mathbf{y}\mathbf{y}^T\right\rangl... ...athbf{y} = \mathbf{x} - \left\langle\mathbf{x}\right\rangle\,, \end{displaymath}$

(2)

and the brackets indicate mean value over the sample of

prototypes.

This matrix $\mathsf{C}$ is real, positive definite, symmetric, and will have all its eigenvalues greater then zero. It will now be show that among the family of all the complete orthonormal bases of the pattern space, the base formed by the eigenvectors of the covariance matrix and belonging to the largest eigenvalues, corresponds to the most significant features of the description of the original prototypes.

let the prototypes be expanded on into a set of basis vectors $\mathbf{e}_n, n=0,\ldots,N,N+1, \ldots, P-1$ ,

$\begin{displaymath} \mathbf{y}_i = \sum^N_{i=0} a_{i_n} \mathbf{e}_n, \quad i = 0, \ldots, M, \quad N < P-1 \end{displaymath}$

(3)

The `best' feature coordinates $\mathbf{e}_n$ , spanning a feature space, will be obtained by minimizing the error due to this truncated expansion, i.e.,

$\begin{displaymath} \min\left(E_N\right) = \min\left[\left\langle\left(\mathb... ...\sum^N_{i=0} a_{i_n} \mathbf{e}_n\right)^2\right\rangle\right] \end{displaymath}$

(4)

with the conditions:

$\begin{displaymath} \mathbf{e}_k\bullet\mathbf{e}_j = \delta_{jk} = \left\{\b... ...for} & k = j\\ 0 & \mbox{for} & k \neq j \end{array}\right. \end{displaymath}$

(5)

Multiplying (3) by $\mathbf{e}^T_n$ using (5), we get

$\begin{displaymath} a_{i_n} = \mathbf{y}_i^T\bullet\mathbf{e}_n\,, \end{displaymath}$

(6)

so the error becomes

$\displaystyle E_N$	$\textstyle =$	$\displaystyle \left\langle\left[\sum_{n=N+1}^{P-1} a_{i_n}\mathbf{e}_n\right]^2\right\rangle$
	$\textstyle =$	$\displaystyle \left\langle\left[\sum_{n=N+1}^{P-1} \mathbf{y}_i^T\bullet\mathbf{e}_n\mathbf{e}_n\right]^2\right\rangle$
	$\textstyle =$	$\displaystyle \left\langle\sum_{n=N+1}^{P-1} \mathbf{e}_n^T\mathbf{y}_i\mathbf{y}_i^T\mathbf{e}_n\right\rangle$
	$\textstyle =$	$\displaystyle \sum_{n=N+1}^{P-1} \mathbf{e}_n^T\mathsf{C}\mathbf{e}_n$	(7)

The minimization of the sum in (7) is obtained when each term $\mathbf{e}_n^\mathsf{C}\mathbf{e}_n$ is minimum, since $\mathsf{C}$ is positive definite. By the method of Lagrange multipliers, and the condition (5), we get

$\begin{displaymath} E_N = \sum^{P-1}_{n=N+1} \left(\mathbf{e}_n^T\mathsf{C}\mathbf{e}_n - l_n\mathbf{e}_n^T\bullet\mathbf{e}_n + l_n\right) \end{displaymath}$

(8)

The minimum condition $\frac{dE_N}{d\mathbf{e}^T_n} = 0$ leads to the equation

$\begin{displaymath} \mathsf{C}\mathbf{e}_n = l_n\mathbf{e}_n\,, \end{displaymath}$

(9)

which shows that $\mathbf{e}_n$ is an eigenvector of the covariance matrix $\mathsf{C}$ with eigenvalue

. The estimated minimum error is then given by

$\begin{displaymath} E_N \sim \sum^{P-1}_{n=N+1} \mathbf{e}_n^T\bullet l_n\mathbf{e}_n = \sum^{P-1}_{n=N+1} l_n\,, \end{displaymath}$

(10)

where $l_n,\,n=N+1,\ldots,P-1$ are the eigenvalues associated with the omitted eigenvectors in the expansion (3). Thus, by choosing the

largest eigenvalues, and their associated eigenvectors, the error

is minimized.

The transformation matrix to go from the pattern space to the feature space consists of the ordered eigenvectors $\mathbf{e}_0,\ldots,\mathbf{e}_{P-1}$ for its columns

$\begin{displaymath} \mathsf{T} = \left[ \begin{array}{cccc} \mathbf{e}_0 & \... ...bf{e}_{1_{P-1}} & \cdots & \mathbf{e}_{{P-1}_{P-1}}\\ \end{array}\right] \end{displaymath}$

(11)

This is an orthogonal transformation, or rotation, of the pattern space and feature selection results in ignoring certain coordinates in the transformed space.

Christian Holm
August 2000, CERN

Function Members (Methods)

public:

	TPrincipal()
	TPrincipal(Int_t nVariables, Option_t* opt = "ND")
virtual	~TPrincipal()
void	TObject::AbstractMethod(const char* method) const
virtual void	AddRow(const Double_t* x)
virtual void	TObject::AppendPad(Option_t* option = "")
virtual void	Browse(TBrowser* b)
static TClass*	Class()
virtual const char*	TObject::ClassName() const
virtual void	Clear(Option_t* option = "")
virtual TObject*	TNamed::Clone(const char* newname = "") const
virtual Int_t	TNamed::Compare(const TObject* obj) const
virtual void	TNamed::Copy(TObject& named) const
virtual void	TObject::Delete(Option_t* option = "")MENU
virtual Int_t	TObject::DistancetoPrimitive(Int_t px, Int_t py)
virtual void	TObject::Draw(Option_t* option = "")
virtual void	TObject::DrawClass() constMENU
virtual TObject*	TObject::DrawClone(Option_t* option = "") constMENU
virtual void	TObject::Dump() constMENU
virtual void	TObject::Error(const char* method, const char* msgfmt) const
virtual void	TObject::Execute(const char* method, const char* params, Int_t* error = 0)
virtual void	TObject::Execute(TMethod* method, TObjArray* params, Int_t* error = 0)
virtual void	TObject::ExecuteEvent(Int_t event, Int_t px, Int_t py)
virtual void	TObject::Fatal(const char* method, const char* msgfmt) const
virtual void	TNamed::FillBuffer(char*& buffer)
virtual TObject*	TObject::FindObject(const char* name) const
virtual TObject*	TObject::FindObject(const TObject* obj) const
const TMatrixD*	GetCovarianceMatrix() const
virtual Option_t*	TObject::GetDrawOption() const
static Long_t	TObject::GetDtorOnly()
const TVectorD*	GetEigenValues() const
const TMatrixD*	GetEigenVectors() const
TList*	GetHistograms() const
virtual const char*	TObject::GetIconName() const
const TVectorD*	GetMeanValues() const
virtual const char*	TNamed::GetName() const
virtual char*	TObject::GetObjectInfo(Int_t px, Int_t py) const
static Bool_t	TObject::GetObjectStat()
virtual Option_t*	TObject::GetOption() const
const Double_t*	GetRow(Int_t row)
const TVectorD*	GetSigmas() const
virtual const char*	TNamed::GetTitle() const
virtual UInt_t	TObject::GetUniqueID() const
const TVectorD*	GetUserData() const
virtual Bool_t	TObject::HandleTimer(TTimer* timer)
virtual ULong_t	TNamed::Hash() const
virtual void	TObject::Info(const char* method, const char* msgfmt) const
virtual Bool_t	TObject::InheritsFrom(const char* classname) const
virtual Bool_t	TObject::InheritsFrom(const TClass* cl) const
virtual void	TObject::Inspect() constMENU
void	TObject::InvertBit(UInt_t f)
virtual TClass*	IsA() const
virtual Bool_t	TObject::IsEqual(const TObject* obj) const
virtual Bool_t	IsFolder() const
Bool_t	TObject::IsOnHeap() const
virtual Bool_t	TNamed::IsSortable() const
Bool_t	TObject::IsZombie() const
virtual void	TNamed::ls(Option_t* option = "") const
virtual void	MakeCode(const char* filename = "pca", Option_t* option = "")MENU
virtual void	MakeHistograms(const char* name = "pca", Option_t* option = "epsdx")MENU
virtual void	MakeMethods(const char* classname = "PCA", Option_t* option = "")MENU
virtual void	MakePrincipals()MENU
void	TObject::MayNotUse(const char* method) const
virtual Bool_t	TObject::Notify()
static void	TObject::operator delete(void* ptr)
static void	TObject::operator delete(void* ptr, void* vp)
static void	TObject::operator delete[](void* ptr)
static void	TObject::operator delete[](void* ptr, void* vp)
void*	TObject::operator new(size_t sz)
void*	TObject::operator new(size_t sz, void* vp)
void*	TObject::operator new[](size_t sz)
void*	TObject::operator new[](size_t sz, void* vp)
virtual void	P2X(const Double_t* p, Double_t* x, Int_t nTest)
virtual void	TObject::Paint(Option_t* option = "")
virtual void	TObject::Pop()
virtual void	Print(Option_t* opt = "MSE") constMENU
virtual Int_t	TObject::Read(const char* name)
virtual void	TObject::RecursiveRemove(TObject* obj)
void	TObject::ResetBit(UInt_t f)
virtual void	TObject::SaveAs(const char* filename = "", Option_t* option = "") constMENU
virtual void	TObject::SavePrimitive(basic_ostream<char,char_traits<char> >& out, Option_t* option = "")
void	TObject::SetBit(UInt_t f)
void	TObject::SetBit(UInt_t f, Bool_t set)
virtual void	TObject::SetDrawOption(Option_t* option = "")MENU
static void	TObject::SetDtorOnly(void* obj)
virtual void	TNamed::SetName(const char* name)MENU
virtual void	TNamed::SetNameTitle(const char* name, const char* title)
static void	TObject::SetObjectStat(Bool_t stat)
virtual void	TNamed::SetTitle(const char* title = "")MENU
virtual void	TObject::SetUniqueID(UInt_t uid)
virtual void	ShowMembers(TMemberInspector& insp, char* parent)
virtual Int_t	TNamed::Sizeof() const
virtual void	Streamer(TBuffer& b)
void	StreamerNVirtual(TBuffer& b)
virtual void	SumOfSquareResiduals(const Double_t* x, Double_t* s)
virtual void	TObject::SysError(const char* method, const char* msgfmt) const
void	Test(Option_t* option = "")MENU
Bool_t	TObject::TestBit(UInt_t f) const
Int_t	TObject::TestBits(UInt_t f) const
virtual void	TObject::UseCurrentStyle()
virtual void	TObject::Warning(const char* method, const char* msgfmt) const
virtual Int_t	TObject::Write(const char* name = 0, Int_t option = 0, Int_t bufsize = 0)
virtual Int_t	TObject::Write(const char* name = 0, Int_t option = 0, Int_t bufsize = 0) const
virtual void	X2P(const Double_t* x, Double_t* p)

protected:

	TPrincipal(const TPrincipal&)
virtual void	TObject::DoError(int level, const char* location, const char* fmt, va_list va) const
void	MakeNormalised()
void	MakeRealCode(const char* filename, const char* prefix, Option_t* option = "")
void	TObject::MakeZombie()
TPrincipal&	operator=(const TPrincipal&)

TMatrixD	fCovarianceMatrix	Covariance matrix
TVectorD	fEigenValues	Eigenvalue vector of trans
TMatrixD	fEigenVectors	Eigenvector matrix of trans
TList*	fHistograms	List of histograms
Bool_t	fIsNormalised	Normalize matrix?
TVectorD	fMeanValues	Mean value over all data points
TString	TNamed::fName	object identifier
Int_t	fNumberOfDataPoints	Number of data points
Int_t	fNumberOfVariables	Number of variables
TVectorD	fOffDiagonal	elements of the tridiagonal
TVectorD	fSigmas	vector of sigmas
Bool_t	fStoreData	Should we store input data?
TString	TNamed::fTitle	object title
Double_t	fTrace	Trace of covarience matrix
TVectorD	fUserData	Vector of original data points

Class Charts

Function documentation

void AddRow(const Double_t* x)

/* > Add a data point and update the covariance matrix. The input array must be fNumberOfVariables long.

The Covariance matrix and mean values of the input data is caculated on the fly by the following equations:

$\begin{displaymath} \left<x_i\right>^{(0)} = x_{i0} \end{displaymath}$

$\begin{displaymath} \left<x_i\right>^{(n)} = \left<x_i\right>^{(n-1)} + \frac1n \left(x_{in} - \left<x_i\right>^{(n-1)}\right) \end{displaymath}$

$\begin{displaymath} C_{ij}^{(0)} = 0 \end{displaymath}$

$\begin{displaymath} C_{ij}^{(n)} = C_{ij}^{(n-1)} + \frac1{n-1}\left[\left(x_{i... ...\left<x_j\right>^{(n)}\right)\right] - \frac1n C_{ij}^{(n-1)} \end{displaymath}$

since this is a really fast method, with no rounding errors (please refer to CERN 72-21 pp. 54-106).

The data is stored internally in a TVectorD, in the following way:

$\begin{displaymath} \mathbf{x} = \left[\left(x_{0_0},\ldots,x_{{P-1}_0}\right),\ldots, \left(x_{0_i},\ldots,x_{{P-1}_i}\right), \ldots\right] \end{displaymath}$

With

as defined in the class description.

*/

enum TObject::EStatusBits {	kCanDelete
	kMustCleanup
	kObjInCanvas
	kIsReferenced
	kHasUUID
	kCannotPick
	kNoContextMenu
	kInvalidObject
};
enum TObject::[unnamed] {	kIsOnHeap
	kNotDeleted
	kZombie
	kBitMask
	kSingleKey
	kOverwrite
	kWriteDelete
};