np.cov()容易忽略的地方/坑

最新推荐文章于 2021-11-30 10:50:27 发布

anthea_luo

最新推荐文章于 2021-11-30 10:50:27 发布

阅读量3.5k

点赞数 3

分类专栏：机器学习

本文链接：https://blog.csdn.net/anthea_luo/article/details/94590170

版权

机器学习专栏收录该内容

12 篇文章 2 订阅

订阅专栏

五个样本，每个样本二维特征
import numpy as np
a = np.array([(2.5, 2.3), (1.5, 1.3), (2.2, 2.9), (2.1, 2.7), (1.7, 1.9)])

用np.cov(a)得到的结果是一个shape为(5, 5)的array
array([[ 0.02 , 0.02 , -0.07 , -0.06 , -0.02 ],
       [ 0.02 , 0.02 , -0.07 , -0.06 , -0.02 ],
       [-0.07 , -0.07 , 0.245, 0.21 , 0.07 ],
       [-0.06 , -0.06 , 0.21 , 0.18 , 0.06 ],
       [-0.02 , -0.02 , 0.07 , 0.06 , 0.02 ]])

觉得很奇怪，两个特征维度，为什么计算结果是5*5的矩阵?

搜了一下协方差，网上的几个例子，跟理解中大部分是一样的。但为什么实际算出来的会完全不一样？
协方差计算的是特征维度之间变化方向同异问题，这里只有两个特征，问题在哪?

发现转置一下，计算出来是预期的样子了：（或者np.cov(a, rowvar=False))
np.cov(a.T)
array([[ 0.16 , 0.195],
[ 0.195, 0.412]])

看一下代码注释：
它这里的variable, observation跟常用的描述角度不一样，初看时很晕，花了好多个小时才弄清。。。(主要原因有两个，第一个错估计很少人犯，第二个坑可能较多人会没注意: 转置了我们通常的方向 )

通常情况下，我们用来训练的数据，一行是一个样本，一列是一个特征。

它这里的variable是特征， observation是样本。个人没接受这种说法前，觉得也可以是：样本是变化的变量，观测是对于某个特征或某一列集中观察结果 (a single observation of all those variables)? 。。。
有个挺大的坑是：按照rowvar的默认值，会把一行当成一个特征，一列当成一个样本。。这跟我们常用的数据位置方向应该是反的。。。也解释了为什么五个样本，每个样本二维特征用np.cov()得到的结果是一个shape为(5, 5)的array

    Parameters
    ----------
    m : array_like
        A 1-D or 2-D array containing multiple variables and observations.
        Each row of `m` represents a variable, and each column a single
        observation of all those variables. Also see `rowvar` below.
    y : array_like, optional
        An additional set of variables and observations. `y` has the same form
        as that of `m`.
    rowvar : bool, optional
        If `rowvar` is True (default), then each row represents a
        variable, with observations in the columns. Otherwise, the relationship
        is transposed: each column represents a variable, while the rows
        contain observations.

网上搜的讲协方差的例子，很多都是用matlab算的。matlab跟我们平时用的数据位置方向是一样的，所以看那些例子，都是顺着自然的。np.cov默认值为啥是这样的呢。。。有大侠知道的欢迎指点