单变量线性判别分析分类方法

最新推荐文章于 2024-08-29 22:32:12 发布

kdaHugh

最新推荐文章于 2024-08-29 22:32:12 发布

阅读量1.6k

点赞数

分类专栏： Python 机器学习文章标签：机器学习统计学 python

本文链接：https://blog.csdn.net/kdazhe/article/details/104468389

版权

Python 同时被 2 个专栏收录

9 篇文章

订阅专栏

机器学习

7 篇文章

订阅专栏

单变量线性判别分析分类方法

单变量线性判别分析

单变量线性判别分析

用贝叶斯方法进行分类

线性判别分析（linear discriminant analysis, LDA）与贝叶斯分类方法的关系十分紧密。我们先来看怎么用贝叶斯方法进行分类。

假设我们有数据集 $\{(X_1, Y_1), (X_2, \, Y_2), \cdots, \}$ 。这里 $X$ 是自变量， $Y$ 是因变量，或称为测量量、要预测的变量。 $X_i$ 可以是一个多维的向量。我们想要把测量到的点（观察点） $Y_i$ 分为 $K$ 类中的一类。我们假设每一类的先验概率为 $\pi_k$ ，即如果我们没有其他信息，随机得抽取一个样本点，这个样本点属于第 $k$ 类的概率就是 $\pi_k$ 。我们用 $\displaystyle f_k (x) = \mathbb{P} (X = x \vert Y = k)$ 来表示当观察点属于第 $k$ 类时，自变量 $X$ 的概率密度分布（当 $X$ 的取值为离散的时候，我们就用概率 $\mathbb{P} (X = x \vert Y = k)$ 表示）。

于是，根据Bayes 公式，我们有
$\begin{aligned} \mathbb{P} (Y = k \vert X = x) &= \frac{\mathbb{P} (X = x, Y = k)}{\mathbb{P} (X = x)} \\ &= \frac{\mathbb{P} (X = x, Y = k)}{\sum_{\ell = 1}^K \mathbb{P}(Y = \ell) \mathbb{P}(X = x \vert Y = \ell)} \\ &= \frac{\pi_k f_k (x) }{\sum_{\ell = 1}^K \pi_{\ell} f_{\ell} (x)} \end{aligned}$

所以，根据 Bayes 公式，如果我们能够计算出 $\displaystyle \frac{\pi_k f_k (x) }{\sum_{\ell = 1}^K \pi_{\ell} f_{\ell} (x)}$ 的值，我们可以比较得出使得 $\displaystyle \frac{\pi_k f_k (x) }{\sum_{\ell = 1}^K \pi_{\ell} f_{\ell} (x)}$ 的值最大的那个类 $k$ ，从而将 $X = x$ 这个数据分到第 $k$ 类中。为了方便起见，我们用 $p_k (x)$ 来表示 $\mathbb{P} (Y = k \vert X = x)$ 。

如果我们有预先的知识能让我们决定先验分布 $\pi_k$ ，我们可利用预先有的知识来确定 $\pi_k$ 。否则，我们就用样本的频数来估计 $\pi_k$ 。即 $\displaystyle \pi_k = n_k / n$ 。这里 $n_k$ 是属于第 $k$ 类的样本点的数目。 $n$ 是总共的样本数。

而线性判别分析，linear discriminant analysis (LDA) 就提供了一个估计 $p_k (x)$ 的方法。下面我们就来介绍如何用 LDA 来进行贝叶斯公式的估计，从而对数据进行分类。

单变量线性判别分析分类

所谓单变量的线性判别分析，是指 $X$ 只有一个变量，即 $X$ 是一维的。我们假设 $Y$ 可以取三个不同的类别。为了估计 $p_k (x)$ ，我们须要对 $f_{k} (x)$ 有一个估计。这里线性判别分析做了一个假设，即假设 $f_{k} (x)$ 服从一个正态分布。回忆上文， $\displaystyle f_k (x)$ 表示当观察点属于第 $k$ 类时，自变量 $X$ 的概率密度分布。根据这个假设，我们有
$f_k (x) = \frac{1}{\sqrt{2 \pi}} \exp \left( - \frac{(x - \mu_k)^2}{2 \sigma_k^2} \right)$

有了对 $f_k (x)$ 的估计，我们便可以计算 $p_k (x)$ 。根据上面的公式，我们有 $\displaystyle p_k (x) = \frac{\pi_k f_k (x) }{\sum_{\ell = 1}^K \pi_{\ell} f_{\ell} (x)}$ 。代入 $f_{k} (x)$ 的表达式，我们有
$p_k (x) = \frac{\pi_k \frac{1}{\sqrt{2 \pi}} \exp \left( - \frac{(x - \mu_k)^2}{2 \sigma_k^2} \right)}{\sum_{\ell = 1}^K \pi_{\ell} \frac{1}{\sqrt{2 \pi}} \exp \left( - \frac{(x - \mu_\ell)^2}{2 \sigma_\ell^2} \right)}$

如果我们再做一个假设，假设 $\displaystyle \sigma_1^2 = \sigma_2^2 = \cdots = \sigma_K^2$ 。即对于不同类别的数据，它们对应的 $X$ 的分布的方差是相同的。那么我们就有

$p_k (x) = \frac{\pi_k \frac{1}{\sqrt{2 \pi}} \exp \left( - \frac{(x - \mu_k)^2}{2 \sigma^2} \right)}{\sum_{\ell = 1}^K \pi_{\ell} \frac{1}{\sqrt{2 \pi}} \exp \left( - \frac{(x - \mu_\ell)^2}{2 \sigma^2} \right)}. \tag{1}$

对于给定的 $X = x$ ，我们找出使得 $p_k (x)$ 最大的那一 $k$ 类，就把 $x$ 划分到第 $k$ 类。

分析公式（1），可以发现分母 $\displaystyle \sum_{\ell = 1}^K \pi_{\ell} \frac{1}{\sqrt{2 \pi}} \exp \left( - \frac{(x - \mu_\ell)^2}{2 \sigma^2} \right)$ 是与 $k$ 无关的。从而我们只需要找到使得分子 $\displaystyle \pi_k \frac{1}{\sqrt{2 \pi}} \exp \left( - \frac{(x - \mu_k)^2}{2 \sigma^2} \right)$ 最大的那个 $k$ 即可。对分子求对数，我们有
$\begin{aligned} \argmax_{k} \left( p_k(x) \right) &= \argmax_{k} \delta_k(x) \\ &= \argmax_k \left( \log(\pi_k) +\frac{x \cdot \mu_k}{\sigma^2} - \frac{\mu_k^2}{2 \sigma^2} \right) \\ \tag{2} \end{aligned}$ 。

由于 $\delta_k(x)$ 关于 $x$ 是线性的，所以我们称之为线性判别分析。

具体地说，如果我们要比较第一类（ $k = 1$ ）和第二类（ $k = 2$ ），我们代入表达式（2），来看什么情况下 $p_1(x)$ 比 $p_2(x)$ 大。这里先假设 $\mu_1 < \mu_2$ ，并且我们假设所有类别的先验概率都是一样的，即 $\pi_1 = \pi_2 = \cdots = \pi_K$ 。
$\begin{aligned} p_1(x) > p_2(x) &\Longleftrightarrow \frac{x \cdot \mu_1}{\sigma^2} - \frac{\mu_1^2}{2\sigma^2} > \frac{x \cdot \mu_2}{\sigma^2} - \frac{\mu_2^2}{2\sigma^2} \\ &\Longleftrightarrow x \cdot (\mu_1 - \mu_2) > \frac{1}{2} (\mu_1^2 - \mu_2^2) \\ &\Longleftrightarrow x < \frac{1}{2} (\mu_1 + \mu_2) \end{aligned}$

也就是说，当 $x$ 的值小于 $\mu_1$ 与 $\mu_2$ 的平均值时， $x$ 就划分为第一类；反之当 $x$ 的值大于 $\mu_1$ 与 $\mu_2$ 的平均值时，就划分为第二类。

如果我们有多余两个的类别的数目（ $K > 2$ ），我们须要计算 $p_1(x), \, p_2(x), \cdots, \, p_K(x)$ ，找到使 $p_k(x)$ 最大的那个 $k$ 。

参数估计

实际问题中，大多数情况下我们是不知道 $\mu_1, \mu_2, \cdots, \, \mu_K$ 以及 $\sigma^2$ 的，这就要求我们对 $\mu_1, \mu_2, \cdots, \, \mu_K$ 以及 $\sigma^2$ 作出估计。
我们用如下方法进行估计 [1]：
$\begin{aligned} \hat{\mu}_k &= \frac{1}{n_k} \sum_{i: y_i = k} x_i \\ \hat{\sigma}^2 &= \frac{1}{n - K} \sum_{k = 1}^K \sum_{i: y_i = k} (x_i - \hat{\mu}_k)^2 \end{aligned}$

这里 $n$ 是样本点的个数， $n_k$ 是第 $k$ 类的样本点的数目。

程序实现

假设 $K = 3$ ，我们的样本数据来自于三个不同的正态分布，如下图所示：
在这里插入图片描述
黑色的虚线和红色的虚线分别代表根据LDA 的线性判别公式得到的分类边界（decision boundary）。

我们根据三个不同的正态分布生成样本数据点。

num_points_each_category = 10
x1_sample = np.random.normal(mu_vec[0], 1, (num_points_each_category, 1)) 
x2_sample = np.random.normal(mu_vec[1], 1, (num_points_each_category, 1))
x3_sample = np.random.normal(mu_vec[2], 1, (num_points_each_category, 1))
x1_label = np.zeros((num_points_each_category, 1)) + 1 # generate the label value for category 1
x2_label = np.zeros((num_points_each_category, 1)) + 2 # generate the label value for category 2
x3_label = np.zeros((num_points_each_category, 1)) + 3 # generate the label value for category 3

# Make x_i of shape (num_points_each_category, 2), with x_i[:,0] being the data value 
# and x_i[:,1] the category label
x1 = np.hstack((x1_sample, x1_label))
x2 = np.hstack((x2_sample, x2_label))
x3 = np.hstack((x3_sample, x3_label))

# Combine them into one ndarray
sample = np.vstack((x1, x2, x3))

sample 的 shape 为 ( $m$ , 2)。这里 $m$ 是所有数据点的个数。样本点生成之后，我们可以根据一下程序对样本点进行分类。

class LDA:
    
    def __init__(self, num_categories, pi_vector, mu_vector, variance):
        """
        num_categories is the number of total categories. 
        vairance is the common distribution variance for all categories.
        pi_vector is the vector containing the prior distribution for
        all categories. len(pi_vector) = num_categories
        mu_vector is the vector containing the mean value for each
        category. len(mu_vector) = num_categories.
        """
        self.num_categories = num_categories
        self.variance = variance
        self.pi_vector = pi_vector
        self.mu_vector = mu_vector
        self.estimated_mu_vector = []
        self.estimated_variance = 0
    
    def discriminantFormula(self, x, pi_k, mu_k, var):
        """
        Calculate the value of the \delta_k(x) function 
        given x. The meaning of \delta_k(x) function is
        defined in the text. 
        pi_k is the prior distribution for the kth category.
        mu_k is the mean value of the kth category.
        """
        delta_k_at_x = x * mu_k / var + \
                np.log(pi_k) - (mu_k ** 2) / (2 * var)
        return delta_k_at_x
    
    def find_best_k(self, x, using_estimated_mu_and_var=True):
        """
        Find the best category that makes \delta_k(x) the largest.
        x is the value of the sample data point.
        """
        cur_max = -float('inf')
        candi = -1
        for i in range(self.num_categories):
            if using_estimated_mu_and_var:
                cur = self.discriminantFormula(x, self.pi_vector[i], 
                            self.estimated_mu_vector[i], self.estimated_variance)
            else:
                cur = self.discriminantFormula(x, self.pi_vector[i], 
                            self.mu_vector[i], self.variance)
            if cur > cur_max:
                candi = i
                cur_max = cur
        return candi + 1 # we add 1 because the index starts from 0
    
    def get_estimation_mu_variance(self, sample):
        """
        In reality we need to estimate the mu_vector and the common
        variance from the sample. sample has a dimension of (m, 2). 
        m is the total number of data points. sample[:,0] is the value
        of the data points, sample[:,1] is the category label for each 
        data point (the label starts from 1). 
        """
        m = sample.shape[0]
        mu_estimated = [0 for _ in range(self.num_categories)]
        variance_total = 0
        for i in range(self.num_categories):
            category_i_data = sample[sample[:,1] == (i + 1)]
            mu_estimated[i] = np.mean(category_i_data[:,0])
            variance_total += np.sum((category_i_data[:,0] - mu_estimated[i]) ** 2)
        variance_estimated = variance_total / (m - self.num_categories)
        self.estimated_mu_vector= mu_estimated
        self.estimated_variance = variance_estimated
        return (mu_estimated, variance_estimated)
    
    def calculate_accuracy(self, sample: 'np.ndarray', using_estimated_mu_and_var=True):
        """
        We calculate the accuracy of the LDA classification method. Here
        sample is an np.ndarray type with the dimension (m, 2). m is the 
        total number of points in sample. 
        """
        m = sample.shape[0]
        predicted = [-1 for _ in range(m)]
        for i in range(m):
            cur_predicted = self.find_best_k(sample[i, 0], using_estimated_mu_and_var)
            predicted[i] = cur_predicted
        return np.sum(np.array(predicted) == sample[:,1]) / m

对样本 sample 进行分析如下。

a = LDA(num_categories=3, pi_vector=[1/3, 1/3, 1.3], 
        mu_vector=mu_vec, variance=1)
estimated_mu_vector, estimated_variance = a.get_estimation_mu_variance(sample)
a.calculate_accuracy(sample, using_estimated_mu_and_var=True)

最终作图如下：

plt.figure(figsize=(10, 8))
plt.plot(x1[:,0], np.zeros(num_points_each_category), 's', 
        x2[:,0], np.zeros(num_points_each_category), 'v',
        x3[:,0], np.zeros(num_points_each_category), 'o', 
         markerfacecolor='none', markersize=20, markeredgewidth=2)
# plt.scatter(x2_sample, np.zeros(10), 'v', markerfacecolor='none', markersize=20, linewidth=8)
# plt.scatter(x3_sample, np.zeros(10), 'o', markerfacecolor='none', markersize=20, linewidth=8)
plt.vlines(1, -0.1, 0.1, color='red',linestyles='--', linewidth=4)
plt.vlines(-1, -0.1, 0.1, color='black',linestyles='--', linewidth=4)
plt.vlines(np.mean([estimated_mu_vector[1], estimated_mu_vector[2]]), -0.1, 0.1, color='red',
           linestyles='-', linewidth=4)
plt.vlines(np.mean([estimated_mu_vector[0], estimated_mu_vector[1]]), -0.1, 0.1, color='black',
           linestyles='-', linewidth=4)
plt.ylim(-0.1, 0.1)
plt.xlabel(r"x", fontsize=36)
#plt.ylabel("$y$", fontsize=36)
plt.xticks(fontsize=24)
plt.yticks([])