连续贝叶斯分类

连续贝叶斯分类

以下所有特征参数的概率密度函数是连续的,故称作连续贝叶斯分类。

例题

一组人类特征的统计数据

性别sex身高height(英尺)体重weight(磅)脚掌foot(英寸)
618012
5.9219011
5.5817012
5.9216510
51006
5.51508
5.421307
5.751509

现已知某人的身高6英尺,体重130磅,脚掌8英尺,请问该人是男是女?

解题思路

正态分布

身高height、体重weight、脚掌foot都是连续变量,三者在性别不同的情况下都服从正态分布。若随机变量 X ∼ N ( μ , σ 2 ) X \sim N(\mu, \sigma^2) XN(μ,σ2),则其概率密度公式:
f ( x ) = 1 2 π σ 2 exp ⁡ ( − ( x − μ ) 2 2 σ 2 ) (1) f(x) = \frac{1}{\sqrt{2 \pi \sigma^2} } \exp{(- \frac{(x - \mu)^2}{2 \sigma^2})} \tag{1} f(x)=2πσ2 1exp(2σ2(xμ)2)(1)
计算男女的身高、体重、脚掌的平均值、方差、最大最小值差

性别类型平均值方差(有偏)最大最小差
maleheight5.8550.03500.42
femaleheight5.41750.09720.75
maleweight176.25122.91725
femaleweight132.5558.33350
malefoot11.250.9172
femalefoot7.51.6673

预测

我们已知三个条件,需要预算性别,因此需要计算男女的两个情况下哪个概率更高。根据贝叶斯公式:
P ( A ∣ B ) = P ( B ∣ A ) ∗ P ( A ) P ( B ) P(A|B) = \frac{P(B|A) * P(A)}{P(B)} P(AB)=P(B)P(BA)P(A)
我们计算男女两者的 P ( s e x ∣ h e i g h t , w e i g h t , f o o t ) P(sex|height, weight, foot) P(sexheight,weight,foot) 时,因为 A = s e x , B = h e i g h t , w e i g h t , f o o t A=sex,B=height,weight,foot A=sex,B=height,weight,foot,且三个条件相互独立, P ( B ) P(B) P(B)是常数,所以仅需要对贝叶斯公式上半部分进行比较。即对比下面两个公式:
P ( h e i g h t ∣ m a l e ) ∗ P ( w e i g h t ∣ m a l e ) ∗ P ( f o o t ∣ m a l e ) ∗ P ( m a l e ) P(height|male) * P(weight|male) * P(foot|male) * P(male) P(heightmale)P(weightmale)P(footmale)P(male)

P ( h e i g h t ∣ f e m a l e ) ∗ P ( w e i g h t ∣ f e m a l e ) ∗ P ( f o o t ∣ f e m a l e ) ∗ P ( f e m a l e ) P(height|female) * P(weight|female) * P(foot|female) * P(female) P(heightfemale)P(weightfemale)P(footfemale)P(female)

而我们又已经假设了三个变量服从正态分布,因此利用概率密度公式(1)和已经的数据求得上述的所有概率,最后通过对比男女的概率得出结果。

其他

求方差需要设定ddof=1,为有偏的除n,因为默认ddof=0,是无偏的除n-1

能够优化的地方

标准化

可以看到不同属性之间最大最小差差异非常大,不利于区分。为了使区分更加精准,可以将所有值标准化则 Y = X − μ σ ∼ N ( 0 , 1 ) Y = \frac{X - \mu}{\sigma} \sim N(0,1) Y=σXμN(0,1),使得所有相应属性都满足标准正态分布

代码

需要一个excel文件写入上面的数据,代码如下:

import numpy as np
import pandas as pd
import math

DATA_FILENAME = 'data.xls'


def read_excel_data(filename):
    '''
    读取文件内容
    :param filename:
    :return:
    '''
    sheet = pd.read_excel(filename)  # 默认读取一个sheet
    data = sheet.values
    return data


def get_math_info(X):
    '''
    获取男女的相关数据
    :param X:
    :return:
    '''
    X_male_row = np.where(X[:, 0] == '男')
    X_male = np.squeeze(X[X_male_row, 1:])
    X_female_row = np.where(X[:, 0] == '女')
    X_female = np.squeeze(X[X_female_row, 1:])
    mean_male = np.mean(X_male, axis=0)
    var_male = np.var(X_male, axis=0, ddof=1)
    mean_female = np.mean(X_female, axis=0)
    var_female = np.var(X_female, axis=0, ddof=1)
    print(mean_male, mean_female)
    print(var_male, var_female)
    # for col in range(np.shape(X_male)[1]):
    #     if var_male[col]:
    #         X_male[:, col] = (X_male[:, col] - mean_male[col]) / math.sqrt(var_male[col])
    #
    # for col in range(np.shape(X_female)[1]):
    #     if var_female[col]:
    #         X_female[:, col] = (X_female[:, col] - mean_female[col]) / math.sqrt(var_female[col])

    return mean_male, mean_female, var_male, var_female, X_male_row[0].size, X_female_row[0].size


def normal_probability(x, mean, var):
    '''
    计算正态分布情况下,x的概率
    :param x: 输入值
    :param mean: 正态分布均值
    :param var: 正态分布方差
    :return: 输入为x时的条件概率
    '''
    return np.exp(- ((x-mean)**2 / (2 * var))) / np.sqrt(2 * np.pi * var)


def calculate_probability(height, weight, foot, mean_arr, var_arr, sex_probability):
    '''
    贝叶斯公式上半部计算
    :param height:
    :param weight:
    :param foot:
    :param mean_arr:
    :param var_arr:
    :param sex_probability:
    :return:
    '''
    probability = normal_probability(height, mean_arr[0], var_arr[0]) * \
                  normal_probability(weight, mean_arr[1], var_arr[1]) * \
                  normal_probability(foot, mean_arr[2], var_arr[2]) * sex_probability
    return probability


def prediction(X, height, weight, foot):
    mean_male, mean_female, var_male, var_female, num_male, num_female = get_math_info(X)
    pro_male = num_male / (num_male + num_female)
    pro_female = 1 - pro_male
    male_probability = calculate_probability(height, weight, foot, mean_male, var_male, pro_male)
    female_probability = calculate_probability(height, weight, foot, mean_female, var_female, pro_female)
    print("男性预测:", male_probability, ",女性预测:", female_probability)
    print(female_probability / male_probability)
    if male_probability > female_probability:
        print("预测为男性")
    else:
        print("预测为女性")

def main():
    X = read_excel_data(DATA_FILENAME)
    prediction(X, 6, 130, 8)


if __name__ == '__main__':
    main()


参考文献

  1. http://www.ruanyifeng.com/blog/2013/12/naive_bayes_classifier.html
  2. https://blog.csdn.net/u013719780/article/details/78388056
  • 2
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 3
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值