连续贝叶斯分类
以下所有特征参数的概率密度函数是连续的,故称作连续贝叶斯分类。
例题
一组人类特征的统计数据
性别sex | 身高height(英尺) | 体重weight(磅) | 脚掌foot(英寸) |
---|---|---|---|
男 | 6 | 180 | 12 |
男 | 5.92 | 190 | 11 |
男 | 5.58 | 170 | 12 |
男 | 5.92 | 165 | 10 |
女 | 5 | 100 | 6 |
女 | 5.5 | 150 | 8 |
女 | 5.42 | 130 | 7 |
女 | 5.75 | 150 | 9 |
现已知某人的身高6英尺,体重130磅,脚掌8英尺,请问该人是男是女?
解题思路
正态分布
身高height、体重weight、脚掌foot都是连续变量,三者在性别不同的情况下都服从正态分布。若随机变量
X
∼
N
(
μ
,
σ
2
)
X \sim N(\mu, \sigma^2)
X∼N(μ,σ2),则其概率密度公式:
f
(
x
)
=
1
2
π
σ
2
exp
(
−
(
x
−
μ
)
2
2
σ
2
)
(1)
f(x) = \frac{1}{\sqrt{2 \pi \sigma^2} } \exp{(- \frac{(x - \mu)^2}{2 \sigma^2})} \tag{1}
f(x)=2πσ21exp(−2σ2(x−μ)2)(1)
计算男女的身高、体重、脚掌的平均值、方差、最大最小值差
性别 | 类型 | 平均值 | 方差(有偏) | 最大最小差 |
---|---|---|---|---|
male | height | 5.855 | 0.0350 | 0.42 |
female | height | 5.4175 | 0.0972 | 0.75 |
male | weight | 176.25 | 122.917 | 25 |
female | weight | 132.5 | 558.333 | 50 |
male | foot | 11.25 | 0.917 | 2 |
female | foot | 7.5 | 1.667 | 3 |
预测
我们已知三个条件,需要预算性别,因此需要计算男女的两个情况下哪个概率更高。根据贝叶斯公式:
P
(
A
∣
B
)
=
P
(
B
∣
A
)
∗
P
(
A
)
P
(
B
)
P(A|B) = \frac{P(B|A) * P(A)}{P(B)}
P(A∣B)=P(B)P(B∣A)∗P(A)
我们计算男女两者的
P
(
s
e
x
∣
h
e
i
g
h
t
,
w
e
i
g
h
t
,
f
o
o
t
)
P(sex|height, weight, foot)
P(sex∣height,weight,foot) 时,因为
A
=
s
e
x
,
B
=
h
e
i
g
h
t
,
w
e
i
g
h
t
,
f
o
o
t
A=sex,B=height,weight,foot
A=sex,B=height,weight,foot,且三个条件相互独立,
P
(
B
)
P(B)
P(B)是常数,所以仅需要对贝叶斯公式上半部分进行比较。即对比下面两个公式:
P
(
h
e
i
g
h
t
∣
m
a
l
e
)
∗
P
(
w
e
i
g
h
t
∣
m
a
l
e
)
∗
P
(
f
o
o
t
∣
m
a
l
e
)
∗
P
(
m
a
l
e
)
P(height|male) * P(weight|male) * P(foot|male) * P(male)
P(height∣male)∗P(weight∣male)∗P(foot∣male)∗P(male)
P ( h e i g h t ∣ f e m a l e ) ∗ P ( w e i g h t ∣ f e m a l e ) ∗ P ( f o o t ∣ f e m a l e ) ∗ P ( f e m a l e ) P(height|female) * P(weight|female) * P(foot|female) * P(female) P(height∣female)∗P(weight∣female)∗P(foot∣female)∗P(female)
而我们又已经假设了三个变量服从正态分布,因此利用概率密度公式(1)和已经的数据求得上述的所有概率,最后通过对比男女的概率得出结果。
其他
求方差需要设定ddof=1,为有偏的除n,因为默认ddof=0,是无偏的除n-1
能够优化的地方
标准化
可以看到不同属性之间最大最小差差异非常大,不利于区分。为了使区分更加精准,可以将所有值标准化则 Y = X − μ σ ∼ N ( 0 , 1 ) Y = \frac{X - \mu}{\sigma} \sim N(0,1) Y=σX−μ∼N(0,1),使得所有相应属性都满足标准正态分布。
代码
需要一个excel文件写入上面的数据,代码如下:
import numpy as np
import pandas as pd
import math
DATA_FILENAME = 'data.xls'
def read_excel_data(filename):
'''
读取文件内容
:param filename:
:return:
'''
sheet = pd.read_excel(filename) # 默认读取一个sheet
data = sheet.values
return data
def get_math_info(X):
'''
获取男女的相关数据
:param X:
:return:
'''
X_male_row = np.where(X[:, 0] == '男')
X_male = np.squeeze(X[X_male_row, 1:])
X_female_row = np.where(X[:, 0] == '女')
X_female = np.squeeze(X[X_female_row, 1:])
mean_male = np.mean(X_male, axis=0)
var_male = np.var(X_male, axis=0, ddof=1)
mean_female = np.mean(X_female, axis=0)
var_female = np.var(X_female, axis=0, ddof=1)
print(mean_male, mean_female)
print(var_male, var_female)
# for col in range(np.shape(X_male)[1]):
# if var_male[col]:
# X_male[:, col] = (X_male[:, col] - mean_male[col]) / math.sqrt(var_male[col])
#
# for col in range(np.shape(X_female)[1]):
# if var_female[col]:
# X_female[:, col] = (X_female[:, col] - mean_female[col]) / math.sqrt(var_female[col])
return mean_male, mean_female, var_male, var_female, X_male_row[0].size, X_female_row[0].size
def normal_probability(x, mean, var):
'''
计算正态分布情况下,x的概率
:param x: 输入值
:param mean: 正态分布均值
:param var: 正态分布方差
:return: 输入为x时的条件概率
'''
return np.exp(- ((x-mean)**2 / (2 * var))) / np.sqrt(2 * np.pi * var)
def calculate_probability(height, weight, foot, mean_arr, var_arr, sex_probability):
'''
贝叶斯公式上半部计算
:param height:
:param weight:
:param foot:
:param mean_arr:
:param var_arr:
:param sex_probability:
:return:
'''
probability = normal_probability(height, mean_arr[0], var_arr[0]) * \
normal_probability(weight, mean_arr[1], var_arr[1]) * \
normal_probability(foot, mean_arr[2], var_arr[2]) * sex_probability
return probability
def prediction(X, height, weight, foot):
mean_male, mean_female, var_male, var_female, num_male, num_female = get_math_info(X)
pro_male = num_male / (num_male + num_female)
pro_female = 1 - pro_male
male_probability = calculate_probability(height, weight, foot, mean_male, var_male, pro_male)
female_probability = calculate_probability(height, weight, foot, mean_female, var_female, pro_female)
print("男性预测:", male_probability, ",女性预测:", female_probability)
print(female_probability / male_probability)
if male_probability > female_probability:
print("预测为男性")
else:
print("预测为女性")
def main():
X = read_excel_data(DATA_FILENAME)
prediction(X, 6, 130, 8)
if __name__ == '__main__':
main()
参考文献
- http://www.ruanyifeng.com/blog/2013/12/naive_bayes_classifier.html
- https://blog.csdn.net/u013719780/article/details/78388056