Bayes方案和传统机器学习方法

最新推荐文章于 2023-12-14 09:10:55 发布

shanesu

最新推荐文章于 2023-12-14 09:10:55 发布

阅读量1.1k

点赞数

分类专栏：课程笔记机器学习 python

本文链接：https://blog.csdn.net/qq_36080693/article/details/80473634

版权

课程笔记同时被 3 个专栏收录

28 篇文章

订阅专栏

python

12 篇文章

订阅专栏

机器学习

7 篇文章

订阅专栏

本文内容主要总结自coursera课程Bayesian Methods for Machine Learning

一、数据

以身高、体重的线性关系为例

身高	体重
170	77
165	65
155	60
178	85

x：身高 y：体重 y = ax+b

二、差别

1.1、传统机器学习方法

参数a、b是一个确定的值，通过拟合线性方程，我们可以根据数据集得到一个确定的a和b值

得到a和b值后，如果我们得到一个新的身高 190，就可以通过公式计算出体重（1.1367*190-118）

1.2、贝叶斯方法

参数a、b并不是一个确定的值，而是属于某一个概率分布。

为什么是一个概率分布呢，因为贝叶斯认为，线性方程是我们想象出来的东西，他并不是真实存在的。

线性方程中，不同位置的偏置bias不同，如上图65和77，所以，对于不同身高的预测准确性是不一样的
线性方法的参数，是基于有限的数据来源得到的，不一定能代表真实世界的值

如果取分布的均值，作为参数a、b的值，那就和传统机器学习方法一样了。

2.1、传统机器学习方法

结果是明确的：如通过公式计算出体重（1.1367*190-118）

2.2、贝叶斯方法

计算结果一定概率上是准确的，或者可以理解为存在对结果的信心。就比如果在数据集中体重77的点大多数都在拟合线附近，且数据比较多，我们会对体重77的身高预测结果更有信心

三、计算

因为我们需要计算参数a、b的概率分布，一种方法就是MCMC 算法

在这个算法种用到了两个方法

Gibbs Sampling：轮流固定a、b，将一个双分布问题，转化为一个单一分布问题（有点类似协同过滤算法）
Metropolis Hastings：

从分布Q(a1)中按概率分布随机得到一个新的a2值
计算如果采用新的a2，y = ax+b是否比采用a1更符合数据源。
如果符合就保留a2，否则就抛弃a2。
重复1-3的过程，我们就得到了一个系列的a值
根据这个系列的a值，去掉最开始的一部分，我们认为这个系列的a值就是分布Q(a)的样本空间
从而，我们得到的了分布Q(a)的均值和其他参数

四、Python 实现

#加载必要包

import numpy as np
import pandas as pd
import numpy.random as rnd
import seaborn as sns
from matplotlib import animation
import pymc3 as pm
%pylab inline

数据源：其中：income_more_50K：1 =》收入大于50k

	sex	age	educ	hours
0	Male	39	13	40
1	Male	50	13	13
2	Male	38	9	40
3	Male	53	7	40
4	Female	28	13	40

$p(y = 1 \mid \alpha, \beta_1, \beta_2) = \sigma(\alpha + \beta_1 x_1 + \beta_2 x_2) \\ \alpha \sim N(0, 100^2) \\\beta_1 \sim N(0, 100^2) \\\beta_2 \sim N(0, 100^2), \\$

y：income_more_50K x1：age x2：educ

梯度计算

with pm.Model() as manual_logistic_model:
   
    alpha = pm.Normal('alpha', mu=0, sd=100)
    beta_age_coefficient = pm.Normal('beta_age_coefficient', mu=0, sd=100)
    beta_education_coefficient = pm.Normal('beta_education_coefficient', mu=0, sd=100)
    
    # Thansform these random variables into vector of probabilities p(y_i=1) using logistic regression model specified 
    # above. PyMC random variables are theano shared variables and support simple mathematical operations.
    # Use pm.invlogit for the sigmoid function.
    
    z = pm.invlogit(alpha + beta_age_coefficient * np.array(data['age']) + \
                    beta_education_coefficient * np.array(data['educ']))
    
    # Declare PyMC Bernoulli random vector with probability of success equal to the corresponding value
    # given by the sigmoid function.
    # Supply target vector using "observed" argument in the constructor.

    y_obs = pm.Bernoulli('y_obs', p=z, observed=data['income_more_50K'])
    
    # Use pm.find_MAP() to find the maximum a-posteriori estimate for the vector of logistic regression weights.
    map_estimate = pm.find_MAP()
    print(map_estimate)

梯度计算简化版本

with pm.Model() as logistic_model:

    pm.glm.GLM.from_formula('income_more_50K ~ age + educ', data, family=pm.glm.families.Binomial())
    map_estimate = pm.find_MAP()
    print(map_estimate)

MCMC算法

with pm.Model() as logistic_model:

    # Train Bayesian logistic regression model on the following features: sex, age, educ, hours
    # Use pm.sample to run MCMC to train this model.
    # To specify the particular sampler method (Metropolis-Hastings) to pm.sample,
    # use `pm.Metropolis`.
    # Train your model for 400 samples.
    # Save the output of pm.sample to a variable: this is the trace of the sampling procedure and will be used
    # to estimate the statistics of the posterior distribution.
    
    pm.glm.GLM.from_formula('income_more_50K ~ sex + age + educ + hours', data, family=pm.glm.families.Binomial())
    with logistic_model:
        trace = pm.sample(400, step=[pm.Metropolis()])

nuts算法

with pm.Model() as logistic_model:
    # Train Bayesian logistic regression model on the following features: sex, age, educ, hours
    # Use pm.sample to run MCMC to train this model.
    # Train your model for *4000* samples (ten times more than before).

    pm.glm.GLM.from_formula('income_more_50K ~ sex + age + educ + hours', data, family=pm.glm.families.Binomial())
    with logistic_model:
        trace = pm.sample(4000, step=[pm.NUTS()])