三 机器学习之主成份分析

主成份分析是多远统计分析的一种方法,主要用于降低数据维度,减少噪音。
数据降维技术主要有主成份分析和因子分析。

一 主成份分析模型介绍

1.1 数学模型

设有样本数据 X=(xij)np;i=1...p,j=1...nxijR X = ( x i j ) n ∗ p ; i = 1... p , j = 1... n , ∀ x i j ∈ R ; n n 为样本数,p是样本的观察指标数目。将样本数据中的一行也就是一个观察样本看成是P维空间的一个点, n n 个样本也就是P维空间中的n点。

为了讨论方便,作如下符号约定:

1) xi. x i . → 表示数据的第i行,即一个样本点;

2) x.i x . i → 表示数据的第i列,即某个观察指标的样本值。

考虑对样本数据进行正交变换,给样本数据换一个坐标系,且保持样本间的距离不变,即: Y=XPPtP=I Y = X P , P t P = I
我们希望得到 YtY=Λ Y t ∗ Y = Λ Λ Λ 为对角阵 。也就是说,转换后的样本,Y的各列之间的线性无关。

如此我们得到:

Λ=YtY=PtXtXP=Pt(XtX)P Λ = Y t ∗ Y = P t X t X P = P t ( X t X ) P

H=XtX H = X t X ,可知 H H 是对称阵,有特征分解 H=QtΣQ ;其中 Q Q 是特征向量矩阵,而且是一个正交矩阵;Σ的对角线元素是 H H 的特征根。

Q有两个等价性质:

  1. Qt=Q1 Q t = Q − 1
  2. QtQ=QQt=I Q t Q = Q Q t = I ; I I 是单位矩阵

P=Qt,则有:

Λ=YtY=PtHP=PtQtΣQP=Σ Λ = Y t ∗ Y = P t H P = P t Q t Σ Q P = Σ

由上面的讨论,我们知道存在正交变换矩阵 Q Q ,即XtX的特征向量矩阵,使得变换后的样本 Y=XQt Y = X Q t 满足如下特性:

  1. 变换后的样本符合我们之前的期望(废话) YtY=Σ Y t ∗ Y = Σ
  2. 任意两个样本的间的距离变换前后保持不变:
    (yj.yk.)(yj.yk.)t=(xj.xk.)QtQ(xj.xk.)t=(xj.xk.)(xj.xk.)t;j,k1..n ( y j . → − y k . → ) ( y j . → − y k . → ) t = ( x j . → − x k . → ) Q t Q ( x j . → − x k . → ) t = ( x j . → − x k . → ) ( x j . → − x k . → ) t ; ∀ j , k ∈ 1.. n
  3. 变换后任意两个观察指标之间的内积为(废话,就是第一条性质,怕水友看不懂,稍微多说点):
    ri,j=y.ity.j=λi,j r i , j = y . i → t y . j → = λ i , j
    ,
    YtY=Λ Y t ∗ Y = Λ 可知:
    λi,j={λi,0,i=jij λ i , j = { λ i , i = j 0 , i ≠ j

    . 其中 λi λ i 是对角阵 Λ Λ 对角线上的元素,是XtX的特征根。变换后的指标之间线性无关。
  4. X X 的每个指标都经过标准化处理,则XtX的特征根是样本变换后对应观察指标的方差。
    σ2i=(y.iy¯¯¯.i)t(y.iy¯¯¯.i)=y.ity.iy¯¯¯2.i=y.ity.i=λi σ i 2 = ( y . i → − y ¯ . i ) t ( y . i → − y ¯ . i ) = y . i → t y . i → − y ¯ . i 2 = y . i → t y . i → = λ i

1.2 主成份提取

主成份提取,是将样本数据进行旋转后,保留方差较大的列,将方差小的列视为观测误差(噪音)予以删除,这样即可以尽可能大的保留样本信息,又达到降低样本维度的效果。因此一般用来对数据去噪降维。

由上面讨论,我们将 XtX X t X 的特征根 σ2i,i=1..p σ i 2 , i = 1.. p ,从大到小排列。取前k个即可。取到特征根后,还要提取相应的特征向量 vi v i
一般取到前k,保证 (k0σ2i)/(p0σ2i)>0.8 ( ∑ 0 k σ i 2 ) / ( ∑ 0 p σ i 2 ) > 0.8 即可。
一般而言,我们可以直接对样本做主成份分析,对 H H 做特征分解;也可以对样本的相关系数矩阵进行特征分解,效果相对于对原始样本先做标准化处理。

1.3 计算过程

模型计算过程:

1: 计算H=XtX
2: 特征分解 H=QΣQt H = Q Σ Q t
3: 取前k个特征值的特征向量,假设是 Q Q 前k列,组成矩阵V
5: 输出 Vt V t

模型运用:

输入样本: x=(x1,x2,...xp) x → = ( x 1 , x 2 , . . . x p ) ,
输出主成份: y=xVt y → = x → V t

二 R语言实现

R语言上做主成份分析比较简单,有两种方案:一种是自己做特征分解;另一种是直接调用prcomp 方法。


直接调用prcomp方法:

数据采用R中自带的数据USArrests

> head(USArrests)
           Murder Assault UrbanPop Rape
Alabama      13.2     236       58 21.2
Alaska       10.0     263       48 44.5
Arizona       8.1     294       80 31.0
Arkansas      8.8     190       50 19.5
California    9.0     276       91 40.6
Colorado      7.9     204       78 38.7

## 参数scale = TRUE,会标准化原始数据,实际上是的对相关系数矩阵进行特征分解。 
> prcomp(USArrests, scale = TRUE)
Standard deviations (1, .., p=4):
[1] 1.5748783 0.9948694 0.5971291 0.4164494

Rotation (n x k) = (4 x 4):
                PC1        PC2        PC3         PC4
Murder   -0.5358995  0.4181809 -0.3412327  0.64922780
Assault  -0.5831836  0.1879856 -0.2681484 -0.74340748
UrbanPop -0.2781909 -0.8728062 -0.3780158  0.13387773
Rape     -0.5434321 -0.1673186  0.8177779  0.08902432

## 参数scale = FALSE,不处理原始数据。
> prcomp(USArrests, scale = FALSE)
Standard deviations (1, .., p=4):
[1] 83.732400 14.212402  6.489426  2.482790

Rotation (n x k) = (4 x 4):
                PC1         PC2         PC3         PC4
Murder   0.04170432 -0.04482166  0.07989066 -0.99492173
Assault  0.99522128 -0.05876003 -0.06756974  0.03893830
UrbanPop 0.04633575  0.97685748 -0.20054629 -0.05816914
Rape     0.07515550  0.20071807  0.97408059  0.07232502

三 Spark代码

/**
 * 读取CSV文件,并计做主成份分析
 * 
 * @param spark
 */
private static void prcomp(SparkSession spark) {
    Dataset<Row> data = spark.read()
            .option("header", true)
            .csv("E:/drawsky/USArrests.dat");
    Dataset<Row> training = data
    .select(data.columns()[1], 
            data.columns()[2], 
            data.columns()[3], 
            data.columns()[4]);
    StructType schema = new StructType(
            new StructField[] { 
                new StructField("features", new VectorUDT(), false, Metadata.empty()) 
            });
    JavaRDD<Row> data1 = training.toJavaRDD()
            .map(e -> RowFactory
                        .create(Vectors.dense(
                            Double.parseDouble(e.getString(0)),
                            Double.parseDouble(e.getString(1)),  
                            Double.parseDouble(e.getString(2)),
                            Double.parseDouble(e.getString(3))
                )
            );
    Dataset<Row> data2 = spark.createDataFrame(JavaRDD.toRDD(data1), schema);
    PCA model = new PCA();
    //设置输入样本列,并且设置输出主要特征为3。
    model.setInputCol("features").setK(3);
    // 计算模型
    PCAModel pca = model.fit(data2);
    System.out.println("特征值矩阵\n" + pca.pc());
    //输出主成份分析结果。
    System.out.println("主成份分析结果:原始数据->主成份"));
    pca.transform(data2).foreach(e -> System.out.println(e.mkString("->")));
}

输出结果:

特征值矩阵
-0.041704320628287106  0.04482165626967025   -0.07989065942080516  
-0.9952212814264966    0.05876002785722259   0.06756973508380448   
-0.04633574611971064   -0.9768574799098904   0.20054628735386223   
-0.07515550058554704   -0.20071806645033347  -0.9740805921824927  

主成份分析结果:原始数据->主成份
[13.2,236.0,58.0,21.2]->[-239.7034893363034,-46.45394440645653,5.873076887678398]
[10.0,263.0,48.0,44.5]->[-267.7287758112544,-39.919009103568335,-16.74843082630301]
[8.1,294.0,80.0,31.0]->[-298.9695419442079,-66.73235484694379,5.0655924039817]
[8.8,190.0,50.0,19.5]->[-193.24136105996698,-41.19804042323062,3.167954683154271]
[9.0,276.0,91.0,40.6]->[-282.32427878003455,-80.42202157466309,-3.367728945064947]
[7.9,204.0,78.0,38.7]->[-209.87731161396687,-71.62153583719555,-8.901218756189458]
[3.3,110.0,77.0,11.1]->[-114.01404372270525,-70.83448196067587,11.798801236151562]
[5.9,238.0,72.0,15.8]->[-241.63235110108388,-59.25574960141734,14.659101392357412]
[15.4,335.0,80.0,31.9]->[-340.1456959738078,-64.17664187383437,6.376077195681571]
[17.4,211.0,60.0,25.8]->[-215.4365022422127,-50.611712212045795,-0.23138540831584464]
[5.3,46.0,83.0,20.2]->[-51.36521988471279,-82.19315971515614,-0.34629879279104614]
[2.6,120.0,54.0,14.2]->[-123.10432340359229,-48.43276080956096,4.8982076036796105]
[10.4,249.0,83.0,24.0]->[-253.89342295172096,-70.79901226567591,9.261408815881683]
[7.2,113.0,65.0,21.0]->[-117.35036491979547,-60.748216516592095,-0.3600164411911919]
[2.2,56.0,57.0,11.3]->[-59.31453595070623,-54.55982130195478,4.032173401475257]
[6.0,115.0,66.0,18.0]->[-119.11163154225757,-61.059185728960145,2.9937798841827252]
[9.7,109.0,52.0,16.3]->[-112.51814504335188,-47.22868033620168,1.141055017579081]
[15.4,249.0,66.0,22.2]->[-253.1789569697733,-53.60703430624881,7.206013699690494]
[2.1,83.0,51.0,7.8]->[-85.64028138839112,-46.414124603401234,8.070549663195614]
[11.3,300.0,67.0,27.8]->[-304.2314611573474,-52.89492032826787,5.725316863721723]
[4.4,149.0,85.0,16.3]->[-153.63504302303224,-77.3521308371684,10.885292398538983]
[12.1,255.0,74.0,35.1]->[-260.3528523267702,-63.806508501283815,-3.086198054041283]
[2.7,72.0,66.0,14.9]->[-75.94651013102968,-63.111552386514596,3.371570287433517]
[16.1,259.0,44.0,17.1]->[-261.7576833408582,-30.473532171373535,8.38158028727971]
[9.0,178.0,70.0,28.2]->[-182.88761432446316,-63.17759320257908,-2.1224356746459847]
[6.0,109.0,53.0,16.4]->[-112.41769035320549,-48.39144975095437,1.5397886855716774]
[4.3,102.0,62.0,16.5]->[-105.80478130328788,-57.69075588744742,2.9101231879669243]
[12.2,252.0,81.0,46.0]->[-258.51490409377396,-73.0041357029064,-12.510550768546914]
[2.1,57.0,56.0,9.5]->[-60.1239711528962,-53.16739344020404,5.660530981075771]
[7.4,159.0,89.0,18.8]->[-164.0856005351248,-81.03929067555256,9.688301440073829]
[11.4,285.0,70.0,32.1]->[-289.76948825888974,-57.565498705965354,1.116874087199438]
[11.1,254.0,86.0,26.1]->[-259.195556172882,-73.8259173462764,8.099403648184488]
[13.0,337.0,45.0,16.1]->[-339.22684014371134,-26.80533654640571,15.074307547557313]
[0.8,45.0,44.0,7.3]->[-47.405729104236734,-41.76691242253186,4.689973871872299]
[7.3,120.0,75.0,21.4]->[-124.81450398327509,-70.18127618164361,1.7208132751189886]
[6.6,151.0,68.0,20.0]->[-155.20760275939892,-61.27208282505877,3.8312873418899436]
[4.9,159.0,67.0,29.3]->[-163.7510860750687,-61.768019955937646,-4.751836451075299]
[6.3,106.0,72.0,14.9]->[-110.21218373051067,-66.81349835625755,6.584612630491144]
[3.4,174.0,87.0,8.3]->[-177.96529822562144,-76.27592222522462,20.848163747222564]
[14.4,279.0,48.0,22.5]->[-282.1823943119608,-34.36583590835889,5.410939061601159]
[3.8,86.0,45.0,12.8]->[-88.79460560394818,-41.30409315696344,2.0637640623960216]
[13.2,188.0,59.0,26.9]->[-192.40758992728888,-51.39537620228001,-2.721983484430563]
[12.7,201.0,80.0,25.5]->[-206.19244739321337,-70.88690845334817,3.7715532648558927]
[3.2,120.0,80.0,22.9]->[-124.98792825017598,-75.5504094715742,1.589975527239858]
[2.2,48.0,32.0,11.2]->[-50.18685649624293,-30.588392720420266,-1.424633603823482]
[8.5,156.0,63.0,20.7]->[-160.0838774955365,-56.14933678582608,2.332755913112379]
[4.0,145.0,73.0,26.2]->[-149.82548667143533,-67.86991871004479,-1.4029835888809359]
[5.7,81.0,39.0,9.3]->[-83.35667867724176,-34.94907403730168,3.7801274825930182]
[2.6,53.0,66.0,10.8]->[-56.72499779946268,-63.40953100898243,6.089464814731533]
[6.8,161.0,60.0,15.6]->[-164.46678626625547,-51.97749888357203,7.172590867615893]

从数据结果上可以看出来,Spark 的PCA和R的prcomp在精度上有所不同。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值