理论课:C1W2.Sentiment Analysis with Naïve Bayes
理论课: C1W2.Sentiment Analysis with Naïve Bayes
导入包
在下面的练习中,将使用朴素贝叶斯特征对推文数据集进行可视化检查,重点理解对数似然比=一对可输入机器学习算法的数字特征。
最后,将介绍置信度椭圆的概念,作为直观表示朴素贝叶斯模型的工具。
import numpy as np # Library for linear algebra and math utils
import pandas as pd # Dataframe library
import matplotlib.pyplot as plt # Library for plots
from utils import confidence_ellipse # Function to add confidence ellipses to charts
Calculate the likelihoods for each tweet
对于每条推文,我们都计算了该推文的正面可能性和负面可能性。下面给出可能性比率的分子和分母。
l
o
g
P
(
t
w
e
e
t
∣
p
o
s
)
P
(
t
w
e
e
t
∣
n
e
g
)
=
l
o
g
(
P
(
t
w
e
e
t
∣
p
o
s
)
)
−
l
o
g
(
P
(
t
w
e
e
t
∣
n
e
g
)
)
log \frac{P(tweet|pos)}{P(tweet|neg)} = log(P(tweet|pos)) - log(P(tweet|neg))
logP(tweet∣neg)P(tweet∣pos)=log(P(tweet∣pos))−log(P(tweet∣neg))
p
o
s
i
t
i
v
e
=
l
o
g
(
P
(
t
w
e
e
t
∣
p
o
s
)
)
=
∑
i
=
0
n
l
o
g
P
(
W
i
∣
p
o
s
)
positive = log(P(tweet|pos)) = \sum_{i=0}^{n}{log P(W_i|pos)}
positive=log(P(tweet∣pos))=i=0∑nlogP(Wi∣pos)
n
e
g
a
t
i
v
e
=
l
o
g
(
P
(
t
w
e
e
t
∣
n
e
g
)
)
=
∑
i
=
0
n
l
o
g
P
(
W
i
∣
n
e
g
)
negative = log(P(tweet|neg)) = \sum_{i=0}^{n}{log P(W_i|neg)}
negative=log(P(tweet∣neg))=i=0∑nlogP(Wi∣neg)
以上公式对应的代码本次实验不做要求,但运行得到的结果放在:'bayes_features.csv’文件中。
data = pd.read_csv('./data/bayes_features.csv'); # Load the data from the csv file
data.head(5) # Print the first 5 tweets features. Each row represents a tweet
结果:
画图:
# Plot the samples using columns 1 and 2 of the matrix
fig, ax = plt.subplots(figsize = (8, 8)) #Create a new figure with a custom size
colors = ['red', 'green'] # Define a color palete
sentiments = ['negative', 'positive']
index = data.index
# Color base on sentiment
for sentiment in data.sentiment.unique():
ix = index[data.sentiment == sentiment]
ax.scatter(data.iloc[ix].positive, data.iloc[ix].negative, c=colors[int(sentiment)], s=0.1, marker='*', label=sentiments[int(sentiment)])
ax.legend(loc='best')
# Custom limits for this chart
plt.xlim(-250,0)
plt.ylim(-250,0)
plt.xlabel("Positive") # x-axis label
plt.ylabel("Negative") # y-axis label
plt.show()
Using Confidence Ellipses to interpret Naïve Bayes
本节我们将使用 置信度椭圆 分析朴素贝叶斯的结果。
置信椭圆是可视化二维随机变量的一种方法。它比在直角坐标平面上绘制点更好,因为在大数据集上,点会严重重叠,从而掩盖数据的真实分布。置信度椭圆只需四个参数就能概括数据集的信息:
- 中心: 中心:是属性的数值平均值。
- 高度和宽度: 高度和宽度:与每个属性的方差有关。用户必须指定绘制椭圆所需的标准偏差量。
- 角度: 与属性间的协方差有关。
参数 n_std 代表椭圆边界的标准差个数。请记住,对于正态随机分布来说
- 曲线下约 68% 的面积落在均值周围 1 个标准差的范围内。
- 约 95% 的曲线下面积在均值周围 2 个标准差以内。
- 约 99.7% 的曲线下面积在均值周围 3 个标准差以内。
下面代码将绘制2_std和3_std
# Plot the samples using columns 1 and 2 of the matrix
fig, ax = plt.subplots(figsize = (8, 8))
colors = ['red', 'green'] # Define a color palete
sentiments = ['negative', 'positive']
index = data.index
# Color base on sentiment
for sentiment in data.sentiment.unique():
ix = index[data.sentiment == sentiment]
ax.scatter(data.iloc[ix].positive, data.iloc[ix].negative, c=colors[int(sentiment)], s=0.1, marker='*', label=sentiments[int(sentiment)])
# Custom limits for this chart
plt.xlim(-200,40)
plt.ylim(-200,40)
plt.xlabel("Positive") # x-axis label
plt.ylabel("Negative") # y-axis label
data_pos = data[data.sentiment == 1] # Filter only the positive samples
data_neg = data[data.sentiment == 0] # Filter only the negative samples
# Print confidence ellipses of 2 std
confidence_ellipse(data_pos.positive, data_pos.negative, ax, n_std=2, edgecolor='black', label=r'$2\sigma$' )
confidence_ellipse(data_neg.positive, data_neg.negative, ax, n_std=2, edgecolor='orange')
# Print confidence ellipses of 3 std
confidence_ellipse(data_pos.positive, data_pos.negative, ax, n_std=3, edgecolor='black', linestyle=':', label=r'$3\sigma$')
confidence_ellipse(data_neg.positive, data_neg.negative, ax, n_std=3, edgecolor='orange', linestyle=':')
ax.legend(loc='lower right')
plt.show()
下面,修改正例推文的特征,使其与负例重合:
data2 = data.copy() # Copy the whole data frame
# The following 2 lines only modify the entries in the data frame where sentiment == 1
#data2.negative[data.sentiment == 1] = data2.negative * 1.5 + 50 # Modify the negative attribute
#data2.positive[data.sentiment == 1] = data2.positive / 1.5 - 50 # Modify the positive attribute
# 对于情感值为1的数据点,修改negative属性
data2.loc[data2.sentiment == 1, 'negative'] = data2.loc[data2.sentiment == 1, 'negative'] * 1.5 + 50
# 对于情感值为1的数据点,修改positive属性
data2.loc[data2.sentiment == 1, 'positive'] = data2.loc[data2.sentiment == 1, 'positive'] / 1.5 - 50
重新绘制图像:
# Plot the samples using columns 1 and 2 of the matrix
fig, ax = plt.subplots(figsize = (8, 8))
colors = ['red', 'green'] # Define a color palete
sentiments = ['negative', 'positive']
index = data2.index
# Color base on sentiment
for sentiment in data2.sentiment.unique():
ix = index[data2.sentiment == sentiment]
ax.scatter(data2.iloc[ix].positive, data2.iloc[ix].negative, c=colors[int(sentiment)], s=0.1, marker='*', label=sentiments[int(sentiment)])
#ax.scatter(data2.positive, data2.negative, c=[colors[int(k)] for k in data2.sentiment], s = 0.1, marker='*') # Plot a dot for tweet
# Custom limits for this chart
plt.xlim(-200,40)
plt.ylim(-200,40)
plt.xlabel("Positive") # x-axis label
plt.ylabel("Negative") # y-axis label
data_pos = data2[data2.sentiment == 1] # Filter only the positive samples
data_neg = data[data2.sentiment == 0] # Filter only the negative samples
# Print confidence ellipses of 2 std
confidence_ellipse(data_pos.positive, data_pos.negative, ax, n_std=2, edgecolor='black', label=r'$2\sigma$' )
confidence_ellipse(data_neg.positive, data_neg.negative, ax, n_std=2, edgecolor='orange')
# Print confidence ellipses of 3 std
confidence_ellipse(data_pos.positive, data_pos.negative, ax, n_std=3, edgecolor='black', linestyle=':', label=r'$3\sigma$')
confidence_ellipse(data_neg.positive, data_neg.negative, ax, n_std=3, edgecolor='orange', linestyle=':')
ax.legend(loc='lower right')
plt.show()
修改后,两个数据的分布开始重合。