高斯判别分析(GDA)——含python代码

基本数学知识

多元正态分布

  多元正态分布是正态分布在多维变量下的扩展,它的参数是一个均值向量 (mean ( m e a n vector)μ v e c t o r ) μ 和协方差矩阵 (covariance ( c o v a r i a n c e matrix)ΣRnn m a t r i x ) Σ ∈ R n ∗ n ,其中n是多维变量的向量长度, ΣRnn Σ ∈ R n ∗ n 是对称正定矩阵。多元正态分布的概率密度函数为:

P(x;μ,Σ)=1(2π)n/2|Σ|1/2e12(xμ)TΣ1(xμ) P ( x ; μ , Σ ) = 1 ( 2 π ) n / 2 | Σ | 1 / 2 e − 1 2 ( x − μ ) T Σ − 1 ( x − μ )

协方差矩阵

  对于一个样本的特征向量,一般有多个属性,我们需要分析各个属性之间的线性关系。协方差及相关系数是度量随机变量间线性关系的参数,由于不知道具体的分布,只能通过样本来进行估计。
  假设我们的样本集合可以表示成矩阵 X=[x1,x2,x3,...,xn]T X = [ x 1 , x 2 , x 3 , . . . , x n ] T ,其中 xi x i 表示第 i i 个样本的特征向量。样本的协方差矩阵的计算公式如下:

(1)Σ=1m1j=1m(xjμ)(xjμ)T

公式中 μ=1mmi=1xi μ = 1 m ∑ i = 1 m x i

GDA建模过程

  高斯判别分析的基本假设是目标值 y y 服从伯努利分布,条件概率P(x|y)服从正态分布。所以它们的概率密度为:

P(y)P(x|y=0)P(x|y=1)===φy(1φ)1y1(2π)n/2|Σ0|1/2e12(xμ0)TΣ10(xμ0)1(2π)n/2|Σ1|1/2e12(xμ1)TΣ11(xμ1)(2)(3)(4) (2) P ( y ) = φ y ( 1 − φ ) 1 − y (3) P ( x | y = 0 ) = 1 ( 2 π ) n / 2 | Σ 0 | 1 / 2 e − 1 2 ( x − μ 0 ) T Σ 0 − 1 ( x − μ 0 ) (4) P ( x | y = 1 ) = 1 ( 2 π ) n / 2 | Σ 1 | 1 / 2 e − 1 2 ( x − μ 1 ) T Σ 1 − 1 ( x − μ 1 )

于是数据集的极大似然函数如下所示:
L(φ,μ0,μ1,Σ)==logi=1mP(xi,yi;ϕ,μ0,μ1,Σ)logi=1mP(xi|yi;ϕ,μ0,μ1,Σ)P(yi;φ)(5)(6) (5) L ( φ , μ 0 , μ 1 , Σ ) = l o g ∏ i = 1 m P ( x i , y i ; ϕ , μ 0 , μ 1 , Σ ) (6) = l o g ∏ i = 1 m P ( x i | y i ; ϕ , μ 0 , μ 1 , Σ ) P ( y i ; φ )

对极大似然函数最大化,可以推导得到各参数的极大似然估计,各参数的极大似然估计如下:
φμ0Σ===1mi=1mI{yi=1}mi=1I{yi=0}ximi=1I{yi=0}1mi=1m(xiμyi)(xiμyi)T(7)(8)(9) (7) φ = 1 m ∑ i = 1 m I { y i = 1 } (8) μ 0 = ∑ i = 1 m I { y i = 0 } x i ∑ i = 1 m I { y i = 0 } (9) Σ = 1 m ∑ i = 1 m ( x i − μ y i ) ( x i − μ y i ) T

如果一个类别的协方差矩阵对角线元素偏小,则说明该类别的样本点比较集中;如果对角线元素偏大,则该类别的样本点比较发散。如果两个类别的协方差矩阵相同,那么分类超平面垂直平分两个分布中心点 μ0,μ1 μ 0 , μ 1 。下面给出一个二维 GDA G D A 模型的例子(来自吴恩达的机器学习课程):

这里写图片描述

代码块

看完吴恩达的机器学习课程后自己用python写了个高斯判别的算法,如果有问题请留言:

import numpy as np
import math
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import scale
from random import random
from sklearn.model_selection import train_test_split

class GDA(object):
    X = np.array([])
    Y = np.array([])
    data_num = 0
    feature_size = 0
    Ppos = 0
    Pneg = 0
    Ave_pos = np.array([])
    Ave_neg = np.array([])
    Var_pos = np.array([])
    Var_neg = np.array([])
    Cov_pos = np.array([])
    Cov_neg = np.array([])

    def __init__(self, train_data, train_target):
        self.X = train_data
        self.Y = train_target
        self.data_num = train_data.shape[0]
        self.feature_size = train_data.shape[1]
        self.Ppos = sum(train_target)/self.data_num
        self.Pneg = 1 - self.Ppos
        posSum = np.zeros((self.feature_size,))
        negSum = np.zeros((self.feature_size,))
        posData = np.array([])
        negData = np.array([])
        for i in range(self.data_num):
            if train_target[i] == 1:
                posSum += train_data[i]
                if posData.size == 0:
                    posData = train_data[i]
                else:
                    posData = np.vstack((posData,train_data[i]))
            else:
                negSum += train_data[i]
                if negData.size == 0:
                    negData = train_data[i]
                else:
                    negData = np.vstack((negData,train_data[i]))
        self.Ave_pos = posSum/sum(train_target)
        self.Ave_neg = negSum/(self.data_num - sum(train_target))
        self.Var_pos = np.zeros((self.feature_size,))
        self.Var_neg = np.zeros((self.feature_size,))
        for i in range(self.data_num):
            self.Var_pos += (train_data[i] - self.Ave_pos)**2
            self.Var_neg += (train_data[i] - self.Ave_neg)**2
        self.Var_pos = self.Var_pos/sum(train_target)
        self.Var_neg = self.Var_neg/(self.data_num - sum(train_target))
        self.Cov_pos = np.cov(posData.T)
        self.Cov_neg = np.cov(negData.T)

    def predict(self, test_data):
        predict_target = []
        Cov_pos_det = np.sqrt(np.linalg.det(self.Cov_pos))
        Cov_neg_det = np.sqrt(np.linalg.det(self.Cov_neg))
        Cov_pos_inv = np.linalg.inv(self.Cov_pos)
        Cov_neg_inv = np.linalg.inv(self.Cov_neg)
        for i in range(test_data.shape[0]):
            tmp1 = math.pow((2*np.pi),test_data.shape[0]/2)#*np.sqrt(Cov_pos_det)
            tmp_pos = tmp1*Cov_pos_det
            tmp_neg = tmp1*Cov_neg_det
            tmp_pos_exp = -np.dot(np.dot((test_data[i]-self.Ave_pos),Cov_pos_inv),((test_data[i]-self.Ave_pos).T))/2
            tmp_neg_exp = -np.dot(np.dot((test_data[i]-self.Ave_neg),Cov_neg_inv),((test_data[i]-self.Ave_neg).T))/2
            P_X_Cpos = np.exp(tmp_pos_exp)/tmp_pos
            P_X_Cneg = np.exp(tmp_neg_exp)/tmp_neg
            tmp_target = self.Ppos*P_X_Cpos/(P_X_Cpos + P_X_Cneg)
            if tmp_target >= 0.5:
                predict_target.append(1)
            else:
                predict_target.append(0)
        return predict_target

    def evaluate(self, predict_target, test_target):
        err = 0
        for i in range(len(predict_target)):
            if predict_target[i] != test_target[i]:
                err += 1
        return err/len(predict_target)

if __name__ == "__main__":
    cancer = load_breast_cancer()
    train_data,test_data,train_target,test_target = train_test_split(cancer.data, \
                                                                     cancer.target, \
                                                                     test_size=0.1, \
                                                                     random_state=0)

    gda = GDA(train_data, train_target)
    predict = gda.predict(test_data)
    print('准确率:',1-gda.evaluate(predict,test_target))
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值