朴素贝叶斯
朴素贝叶斯方法是一组基于应用贝叶斯定理的监督学习算法,其“朴素”假设是给定类变量值的每对特征之间的条件独立性。贝叶斯定理陈述了以下关系,给定类变量
𝑦
𝑦
y 和从属特征向量
𝑥
1
𝑥_1
x1 到
𝑥
𝑛
𝑥_𝑛
xn ,:
P
(
y
∣
x
1
,
…
,
x
n
)
=
P
(
y
)
P
(
x
1
,
…
x
n
∣
y
)
P
(
x
1
,
…
,
x
n
)
P(y \mid x_1, \dots, x_n) = \frac{P(y) P(x_1, \dots x_n \mid y)} {P(x_1, \dots, x_n)}
P(y∣x1,…,xn)=P(x1,…,xn)P(y)P(x1,…xn∣y)
使用朴素的条件独立假设,我们有
KaTeX parse error: No such environment: align at position 7: \begin{̲a̲l̲i̲g̲n̲}̲\begin{aligned}…
然后,我们可以使用最大后验(MAP)估计来估计 𝑃 ( 𝑦 ) 𝑃(𝑦) P(y)和 𝑃 ( 𝑥 𝑖 ∣ 𝑦 ) 𝑃(𝑥_𝑖∣𝑦) P(xi∣y);前者是训练集中类 𝑦 𝑦 y的相对频率。
高斯贝叶斯
GaussianNB 实现了高斯朴素贝叶斯算法进行分类。
假设特征的可能性是高斯的:
P
(
x
i
∣
y
)
=
1
2
π
σ
y
2
exp
(
−
(
x
i
−
μ
y
)
2
2
σ
y
2
)
P(x_i \mid y) = \frac{1}{\sqrt{2\pi\sigma^2_y}} \exp\left(-\frac{(x_i - \mu_y)^2}{2\sigma^2_y}\right)
P(xi∣y)=2πσy21exp(−2σy2(xi−μy)2)
参数
σ
y
\sigma_y
σy和
μ
y
\mu_y
μy使用最大似然估计。
示例 - 训练数据生成如下:
import numpy as np
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
y = np.array([1, 1, 1, 2, 2, 2])
**Q1:**训练 GaussianNB 模型:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_digits #load_digits手写数字数据集
from sklearn.model_selection import train_test_split
#实例化
clf = GaussianNB()
#训练数据 fit相当于train
clf.fit(X, y)
GaussianNB(priors=None, var_smoothing=1e-09)
Q2: 预测数据 [-0.8,-1]
X_test = np.array([[-0.8,-1]])
y_test = np.array([1])
pred = clf.predict(X_test)
print(pred)
[1]
多项式NB
多项式朴素贝叶斯分类器适用于具有离散特征的分类(例如,用于文本分类的字数)。
示例 - 训练数据生成如下:
import numpy as np
X = np.random.randint(5, size=(6, 100))
y = np.array([1, 2, 3, 4, 5, 6])
**Q3:**训练 MultinomialNB 模型:
#导⼊入需要的模块和库
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_blobs
from sklearn.metrics import brier_score_loss
clf =MultinomialNB()
#训练数据 fit相当于train
clf.fit(X, y)
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
**Q4:**预测数据X[2:3]:
X_test = np.random.randint(2,size=(3,100))
y_test = np.array([1])
pred = clf.predict(X_test)
print(pred)
[1 3 5]
处理“鸢尾花”数据
使用 GaussianNB 算法对 ‘iris’ 数据进行分类任务
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
iris_dataset = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris_dataset['data'], iris_dataset['target'], random_state=142)
**Q5:**报告测试数据的精度结果:
model = GaussianNB().fit(X_train,y_train)
pre_y = model.predict(X_test)
m=0
for i in range(len(pre_y)):
if pre_y[i]==y_test[i]:
m=m+1
print ("精度:",m*1.0/len(pre_y))
精度: 0.8947368421052632
预测人类活动识别 (HAR)
本练习的目的是根据来自 HAR 数据集中 53 个不同特征的生理活动测量来预测当前的人类活动。提供了训练 (har_train.csv
) 和测试 (har_validate.csv
) 数据集。
**Q6:**构建朴素贝叶斯模型,对测试数据集进行预测并计算混淆矩阵。注:请参考sklearn.metrics.confusion_matrix
import pandas as pd
from sklearn.metrics import confusion_matrix
train=pd.read_csv('C:/Users/QYH123/Python/数据挖掘/har_train.csv',parse_dates=True)
test=pd.read_csv('C:/Users/QYH123/Python/数据挖掘//har_validate.csv',parse_dates=True)
train.head()
classe | roll_belt | pitch_belt | yaw_belt | total_accel_belt | gyros_belt_x | gyros_belt_y | gyros_belt_z | accel_belt_x | accel_belt_y | ... | total_accel_forearm | gyros_forearm_x | gyros_forearm_y | gyros_forearm_z | accel_forearm_x | accel_forearm_y | accel_forearm_z | magnet_forearm_x | magnet_forearm_y | magnet_forearm_z | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | A | 1.41 | 8.07 | -94.4 | 3 | 0.00 | 0.0 | -0.02 | -21 | 4 | ... | 36 | 0.03 | 0.00 | -0.02 | 192 | 203 | -215 | -17 | 654 | 476 |
1 | A | 1.41 | 8.07 | -94.4 | 3 | 0.02 | 0.0 | -0.02 | -22 | 4 | ... | 36 | 0.02 | 0.00 | -0.02 | 192 | 203 | -216 | -18 | 661 | 473 |
2 | A | 1.42 | 8.07 | -94.4 | 3 | 0.00 | 0.0 | -0.02 | -20 | 5 | ... | 36 | 0.03 | -0.02 | 0.00 | 196 | 204 | -213 | -18 | 658 | 469 |
3 | A | 1.48 | 8.05 | -94.4 | 3 | 0.02 | 0.0 | -0.03 | -22 | 3 | ... | 36 | 0.02 | -0.02 | 0.00 | 189 | 206 | -214 | -16 | 658 | 469 |
4 | A | 1.45 | 8.06 | -94.4 | 3 | 0.02 | 0.0 | -0.02 | -21 | 4 | ... | 36 | 0.02 | -0.02 | -0.03 | 193 | 203 | -215 | -9 | 660 | 478 |
5 rows × 53 columns
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13737 entries, 0 to 13736
Data columns (total 53 columns):
classe 13737 non-null object
roll_belt 13737 non-null float64
pitch_belt 13737 non-null float64
yaw_belt 13737 non-null float64
total_accel_belt 13737 non-null int64
gyros_belt_x 13737 non-null float64
gyros_belt_y 13737 non-null float64
gyros_belt_z 13737 non-null float64
accel_belt_x 13737 non-null int64
accel_belt_y 13737 non-null int64
accel_belt_z 13737 non-null int64
magnet_belt_x 13737 non-null int64
magnet_belt_y 13737 non-null int64
magnet_belt_z 13737 non-null int64
roll_arm 13737 non-null float64
pitch_arm 13737 non-null float64
yaw_arm 13737 non-null float64
total_accel_arm 13737 non-null int64
gyros_arm_x 13737 non-null float64
gyros_arm_y 13737 non-null float64
gyros_arm_z 13737 non-null float64
accel_arm_x 13737 non-null int64
accel_arm_y 13737 non-null int64
accel_arm_z 13737 non-null int64
magnet_arm_x 13737 non-null int64
magnet_arm_y 13737 non-null int64
magnet_arm_z 13737 non-null int64
roll_dumbbell 13737 non-null float64
pitch_dumbbell 13737 non-null float64
yaw_dumbbell 13737 non-null float64
total_accel_dumbbell 13737 non-null int64
gyros_dumbbell_x 13737 non-null float64
gyros_dumbbell_y 13737 non-null float64
gyros_dumbbell_z 13737 non-null float64
accel_dumbbell_x 13737 non-null int64
accel_dumbbell_y 13737 non-null int64
accel_dumbbell_z 13737 non-null int64
magnet_dumbbell_x 13737 non-null int64
magnet_dumbbell_y 13737 non-null int64
magnet_dumbbell_z 13737 non-null int64
roll_forearm 13737 non-null float64
pitch_forearm 13737 non-null float64
yaw_forearm 13737 non-null float64
total_accel_forearm 13737 non-null int64
gyros_forearm_x 13737 non-null float64
gyros_forearm_y 13737 non-null float64
gyros_forearm_z 13737 non-null float64
accel_forearm_x 13737 non-null int64
accel_forearm_y 13737 non-null int64
accel_forearm_z 13737 non-null int64
magnet_forearm_x 13737 non-null int64
magnet_forearm_y 13737 non-null int64
magnet_forearm_z 13737 non-null int64
dtypes: float64(24), int64(28), object(1)
memory usage: 5.6+ MB
from sklearn.model_selection import train_test_split
X = train.loc[:,train.columns != 'classe']
y = test['classe']
from sklearn.naive_bayes import BernoulliNB
model = BernoulliNB()
model.fit(X_train, y_train)
BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)
y_pred = model.predict(X_test)
import pandas as pd
from sklearn import *
print("model accuracy:", metrics.accuracy_score(y_test, y_pred))
model accuracy: 0.18421052631578946