PCA(Principal Component Analysis),即主成分分析,主要用于数据降维。
对于一组样本的feature组成的多维向量,多维向量里的某些元素本身没有区分性,比如某个元素在所有的样本中都为1,或者与1差距不大,那么这个元素本身就没有区分性,用它做特征来区分,贡献会非常小。所以我们的目的是找那些变化大的元素,即方差大的那些维,而去除掉那些变化不大的维,从而使feature留下的都是最能代表此元素的“精品”,而且计算量也变小了。
对于一个k维的feature来说,相当于它的每一维feature与其他维都是正交的(相当于在多维坐标系中,坐标轴都是垂直的),那么我们可以变化这些维的坐标系,从而使这个feature在某些维上方差大,而在某些维上方差很小。例如,一个45度倾斜的椭圆,在第一坐标系,如果按照x,y坐标来投影,这些点的x和y的属性很难用于区分他们,因为他们在x,y轴上坐标变化的方差都差不多,我们无法根据这个点的某个x属性来判断这个点是哪个,而如果将坐标轴旋转,以椭圆长轴为x轴,则椭圆在长轴上的分布比较长,方差大,而在短轴上的分布短,方差小,所以可以考虑只保留这些点的长轴属性,来区分椭圆上的点,这样,区分性比x,y轴的方法要好!
所以我们的做法就是求得一个k维特征的投影矩阵,这个投影矩阵可以将feature从高维降到低维。投影矩阵也可以叫做变换矩阵。新的低维特征必须每个维都正交,特征向量都是正交的。通过求样本矩阵的协方差矩阵,然后求出协方差矩阵的特征向量,这些特征向量就可以构成这个投影矩阵了。特征向量的选择取决于协方差矩阵的特征值的大小。
举一个例子:
对于一个训练集,100个样本,feature是10维,那么它可以建立一个100*10的矩阵,作为样本。求这个样本的协方差矩阵,得到一个10*10的协方差矩阵,然后求出这个协方差矩阵的特征值和特征向量,应该有10个特征值和特征向量,我们根据特征值的大小,取前四个特征值所对应的特征向量,构成一个10*4的矩阵,这个矩阵就是我们要求的特征矩阵,100*10的样本矩阵乘以这个10*4的特征矩阵,就得到了一个100*4的新的降维之后的样本矩阵,每个样本的维数下降了。
当给定一个测试的feature集之后,比如1*10维的feature,乘以上面得到的10*4的特征矩阵,便可以得到一个1*4的feature,用这个feature去分类。
所以做PCA实际上是求得这个投影矩阵,用高维的特征乘以这个投影矩阵,便可以将高维特征的维数下降到指定的维数。
在opencv里面有专门的函数,可以得到这个这个投影矩阵(特征矩阵)。
void cvCalcPCA( const CvArr* data, CvArr* avg, CvArr* eigenvalues, CvArr* eigenvectors, int flags );
简单使用 cvCalcPCA 计算主成分的代码如下:
CvMat* pData = cvCreateMat(100, 2, CV_32FC1); //二维数据点
for(int i = 0; i < 100; i++)
{
cvSet2D(pData, i, 0,cvRealScalar(i));
cvSet2D(pData, i, 1,cvRealScalar(i));
}
CvMat* pMean = cvCreateMat(1, 2, CV_32FC1);
CvMat* pEigVals = cvCreateMat(1, 2, CV_32FC1);
CvMat* pEigVecs = cvCreateMat(2, 2, CV_32FC1);
cvCalcPCA(pData, pMean, pEigVals, pEigVecs, CV_PCA_DATA_AS_ROW );
float pp[100];
memcpy(pp,pEigVals->data.fl,100 );
memcpy(pp,pEigVecs->data.fl,100 );
memcpy(pp,pMean->data.fl,100 );
Principal component analysis (PCA) has been called one of the most valuable results from applied linear algebra.PCA is used abundantly in all forms of analysis -from neuroscience to computer graphics – because it is a simple, non-parametric method of extracting relevant information from confusing data sets. With minimal additional effort PCA provides a roadmap for how to reduce a complex data set to a lower dimension to reveal the sometimes hidden, simplified structure that often underlie it.
关于PCA讲得比较全面的一个课件:点击查看
PCA技术的一大好处是对数据进行降维的处理。我们可以对新求出的“主元”向量的重要性进行排序,根据需要取前面最重要的部分,将后面的维数省去,可以达到降维从而简化模型或是对数据进行压缩的效果。同时最大程度的保持了原有数据的信息。
PCA is a way of identifying patterns in data, and expressing the data in such a way as to highlight their similarities and differences. Since patterns in data can be hard to find in data of
high dimension, where the luxury of graphical representation is not available, PCA is a powerful tool for analysing data.The other main advantage of PCA is that once you have found these patterns in the data, and you compress the data, ie. by reducing the number of dimensions, without much loss of information. This technique used in image compression,.
method:
Step 1: Get some data.
Step 2: Subtract the mean. The mean subtracted is the average across each dimension.
Step 3: Calculate the covariance matrix.
Step 4: Calculate the eigenvectors and eigenvalues of the covariance matrix.
Step 5: Choosing components and forming a feature vector.
Here is where the notion of data compression and reduced dimensionality comes into it.In fact, it turns out that the eigenvector with the highest eigenvalue is the principle component of the data set.
In general, once eigenvectors are found from the covariance matrix, the next step is to order them by eigenvalue, highest to lowest. This gives you the components in order of significance. Now, if you like, you can decide to ignore the components of lesser significance.
What needs to be done now is you need to form a feature vector, which is just a fancy name for a matrix of vectors. This is constructed by taking the eigenvectors that you want to keep from the list of eigenvectors, and forming a matrix with these eigenvectors in the columns.
Step 5: Deriving the new data set
Once we have chosen the components (eigenvectors) that we wish to keep in our data and formed a feature vector, we simply take the transpose of the vector and multiply it on the left of the original data set, transposed.
It will give us the original data solely in terms of the vectors we chose.
Reference:AnalysisLindsay I Smith. A tutorial on Principal Components. February 26, 2002
数学原理上的解释:
主成分分析时: 协方差矩阵Cx包含了所有观测变量之间的相关性度量。更重要的是,这些相关性度量反映了数据的噪音和冗余的程度。
l
l
一般情况下,初始数据的协方差矩阵总是不太好的,表现为信噪比不高且变量间相关度大。PCA的目标就是通过基变换对协方差矩阵进行优化,找到相关“主元”。那么,如何进行优化?矩阵的那些性质是需要注意的呢?
主元分析以及协方差矩阵优化的原则是:1)最小化变量冗余,对应于协方差矩阵的非对角元素要尽量小(即0);2)最大化信号,对应于要使协方差矩阵的对角线上的元素尽可能的大。而优化矩阵Cy对角线上的元素越大,就说明信号的成分越大,换句话就是对应于越重要的“主元”。
PCA的假设条件(和局限)包括:
1.
2.
3.
4.
由简单推导可知,如果对奇异值分解(A=USV’)加以约束:U的向量必须正交,则矩阵S即为PCA的特征值分解中的E(特征向量矩阵),则说明PCA并不一定需要求取Cy特征值,也可以直接对原数据矩阵A进行奇异值分解即可得到特征向量矩阵,也就是主元向量。
参考
http://hi.baidu.com/yangbme/blog/item/b8e86c0f1aed612e6059f3da
======================================
其中X为原图像,Y为目标图像,A为特征向量矩阵。由此我怀疑我的特征矩阵求取有问题。后来从网上找了一种求特征矩阵的办法,进行主成分分析的效果。下面是具体的实现代码:
//计算特征向量
static int iJcobiMatrixCharacterValue(double** pdblCof, long lChannelCount, std::vector<double>& pdblVects, double dblEps,long ljt)
{
long i,j,p,q,u,w,t,s,l;
double fm,cn,sn,omega,x,y,d;
l = 1;
for(i = 0; i < lChannelCount; i ++)
{
pdblVects[i * lChannelCount + i] = 1.0;
for(j = 0; j < lChannelCount; j ++)
if(i != j) pdblVects[i * lChannelCount + j] = 0.0;
}while(1){
fm = 0.0;
for(i = 0; i < lChannelCount; i ++)
for(j = 0; j < lChannelCount; j ++)
{
d = fabs(pdblCof[i][j]);
if((i != j)&&(d > fm))
{ fm = d; p = i; q = j;}
}
if(fm < dblEps) return 1;
if(l > ljt) return 0;l += 1;
u = p * lChannelCount + q; w = p * lChannelCount + p; t = q * lChannelCount + p; s = q * lChannelCount + q;x = -pdblCof[p][q];
y = (pdblCof[q][q] – pdblCof[p][p])/2.0;omega = x / sqrt(x * x + y * y);
if(y < 0) omega = -omega;sn = 1.0 + sqrt(1.0 – omega * omega);
sn = omega / sqrt(2.0 * sn);
cn = sqrt(1.0 – sn * sn);
fm = pdblCof[p][p];pdblCof[p][p] = fm * cn * cn + pdblCof[q][q] * sn * sn + pdblCof[p][q] * omega;
pdblCof[q][q] = fm * sn * sn + pdblCof[q][q] * cn * cn – pdblCof[p][q] * omega;
pdblCof[p][q] = 0.0;
pdblCof[q][p] = 0.0;for(j = 0;j < lChannelCount ; j++)
if((j != p) && (j != q))
{
fm = pdblCof[p][j];
pdblCof[p][j] = fm * cn + pdblCof[q][j] * sn;
pdblCof[q][j] =-fm * sn + pdblCof[q][j] * cn;
}for(i = 0; i < lChannelCount; i++)
if((i != p) && ( i != q)){fm = pdblCof[i][p];
pdblCof[i][p] = fm * cn + pdblCof[i][q] * sn;
pdblCof[i][q] =-fm * sn + pdblCof[i][q] * cn;
}for(i = 0; i < lChannelCount; i++)
{
fm = pdblVects[i * lChannelCount + p];
pdblVects[i * lChannelCount + p] = fm * cn + pdblVects[i * lChannelCount + q] * sn;
pdblVects[i * lChannelCount + q] =-fm * sn + pdblVects[i * lChannelCount + q] * cn;
}
}
return 1;
}// 根据特征值从大到小排列特征向量矩阵
static void SortEigenvector(double** pfMatrix,int nBandNum,std::vector<double> &pfVector)
{
long p;
double f;
double T;int count = nBandNum;
for(int i = 0; i < count ; i ++)
{
T = pfMatrix[i][i];
p = i;
for(int j = i; j < count; j ++)
if(T < pfMatrix[j][j])
{
T = pfMatrix[j][j];
p = j;
}
if(p != i)
{
f = pfMatrix[p][p];
pfMatrix[p][p] = pfMatrix[i][i];
pfMatrix[i][i] = f;for(int j = 0; j < count; j ++)
{
f = pfVector[j * count +p];
pfVector[j * count + p] = pfVector[j * count + i];
pfVector[j * count + i] = f;
}
}
}
}执行上面两步之后,所得到的特征矩阵为用于和原图像相乘的矩阵A.。