基本数学知识
多元正态分布
多元正态分布是正态分布在多维变量下的扩展,它的参数是一个均值向量
(mean
(
m
e
a
n
vector)μ
v
e
c
t
o
r
)
μ
和协方差矩阵
(covariance
(
c
o
v
a
r
i
a
n
c
e
matrix)Σ∈Rn∗n
m
a
t
r
i
x
)
Σ
∈
R
n
∗
n
,其中n是多维变量的向量长度,
Σ∈Rn∗n
Σ
∈
R
n
∗
n
是对称正定矩阵。多元正态分布的概率密度函数为:
协方差矩阵
对于一个样本的特征向量,一般有多个属性,我们需要分析各个属性之间的线性关系。协方差及相关系数是度量随机变量间线性关系的参数,由于不知道具体的分布,只能通过样本来进行估计。
假设我们的样本集合可以表示成矩阵
X=[x1,x2,x3,...,xn]T
X
=
[
x
1
,
x
2
,
x
3
,
.
.
.
,
x
n
]
T
,其中
xi
x
i
表示第
i
i
个样本的特征向量。样本的协方差矩阵的计算公式如下:
公式中 μ=1m∑mi=1xi μ = 1 m ∑ i = 1 m x i 。
GDA建模过程
高斯判别分析的基本假设是目标值
y
y
服从伯努利分布,条件概率服从正态分布。所以它们的概率密度为:
于是数据集的极大似然函数如下所示:
对极大似然函数最大化,可以推导得到各参数的极大似然估计,各参数的极大似然估计如下:
如果一个类别的协方差矩阵对角线元素偏小,则说明该类别的样本点比较集中;如果对角线元素偏大,则该类别的样本点比较发散。如果两个类别的协方差矩阵相同,那么分类超平面垂直平分两个分布中心点 μ0,μ1 μ 0 , μ 1 。下面给出一个二维 GDA G D A 模型的例子(来自吴恩达的机器学习课程):
代码块
看完吴恩达的机器学习课程后自己用python写了个高斯判别的算法,如果有问题请留言:
import numpy as np
import math
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import scale
from random import random
from sklearn.model_selection import train_test_split
class GDA(object):
X = np.array([])
Y = np.array([])
data_num = 0
feature_size = 0
Ppos = 0
Pneg = 0
Ave_pos = np.array([])
Ave_neg = np.array([])
Var_pos = np.array([])
Var_neg = np.array([])
Cov_pos = np.array([])
Cov_neg = np.array([])
def __init__(self, train_data, train_target):
self.X = train_data
self.Y = train_target
self.data_num = train_data.shape[0]
self.feature_size = train_data.shape[1]
self.Ppos = sum(train_target)/self.data_num
self.Pneg = 1 - self.Ppos
posSum = np.zeros((self.feature_size,))
negSum = np.zeros((self.feature_size,))
posData = np.array([])
negData = np.array([])
for i in range(self.data_num):
if train_target[i] == 1:
posSum += train_data[i]
if posData.size == 0:
posData = train_data[i]
else:
posData = np.vstack((posData,train_data[i]))
else:
negSum += train_data[i]
if negData.size == 0:
negData = train_data[i]
else:
negData = np.vstack((negData,train_data[i]))
self.Ave_pos = posSum/sum(train_target)
self.Ave_neg = negSum/(self.data_num - sum(train_target))
self.Var_pos = np.zeros((self.feature_size,))
self.Var_neg = np.zeros((self.feature_size,))
for i in range(self.data_num):
self.Var_pos += (train_data[i] - self.Ave_pos)**2
self.Var_neg += (train_data[i] - self.Ave_neg)**2
self.Var_pos = self.Var_pos/sum(train_target)
self.Var_neg = self.Var_neg/(self.data_num - sum(train_target))
self.Cov_pos = np.cov(posData.T)
self.Cov_neg = np.cov(negData.T)
def predict(self, test_data):
predict_target = []
Cov_pos_det = np.sqrt(np.linalg.det(self.Cov_pos))
Cov_neg_det = np.sqrt(np.linalg.det(self.Cov_neg))
Cov_pos_inv = np.linalg.inv(self.Cov_pos)
Cov_neg_inv = np.linalg.inv(self.Cov_neg)
for i in range(test_data.shape[0]):
tmp1 = math.pow((2*np.pi),test_data.shape[0]/2)#*np.sqrt(Cov_pos_det)
tmp_pos = tmp1*Cov_pos_det
tmp_neg = tmp1*Cov_neg_det
tmp_pos_exp = -np.dot(np.dot((test_data[i]-self.Ave_pos),Cov_pos_inv),((test_data[i]-self.Ave_pos).T))/2
tmp_neg_exp = -np.dot(np.dot((test_data[i]-self.Ave_neg),Cov_neg_inv),((test_data[i]-self.Ave_neg).T))/2
P_X_Cpos = np.exp(tmp_pos_exp)/tmp_pos
P_X_Cneg = np.exp(tmp_neg_exp)/tmp_neg
tmp_target = self.Ppos*P_X_Cpos/(P_X_Cpos + P_X_Cneg)
if tmp_target >= 0.5:
predict_target.append(1)
else:
predict_target.append(0)
return predict_target
def evaluate(self, predict_target, test_target):
err = 0
for i in range(len(predict_target)):
if predict_target[i] != test_target[i]:
err += 1
return err/len(predict_target)
if __name__ == "__main__":
cancer = load_breast_cancer()
train_data,test_data,train_target,test_target = train_test_split(cancer.data, \
cancer.target, \
test_size=0.1, \
random_state=0)
gda = GDA(train_data, train_target)
predict = gda.predict(test_data)
print('准确率:',1-gda.evaluate(predict,test_target))