C1W2.LAB.Visualizing Naive Bayes

最新推荐文章于 2024-08-12 21:37:19 发布

oldmao_2000

最新推荐文章于 2024-08-12 21:37:19 发布

阅读量878

点赞数 17

分类专栏： DL.AI NLPS实验与作业文章标签：自然语言处理 NLP 人工智能

本文链接：https://blog.csdn.net/oldmao_2001/article/details/140428962

版权

DL.AI NLPS实验与作业专栏收录该内容

18 篇文章 0 订阅

订阅专栏

理论课：C1W2.Sentiment Analysis with Naïve Bayes

文章目录

导入包
Calculate the likelihoods for each tweet
Using Confidence Ellipses to interpret Naïve Bayes

理论课： C1W2.Sentiment Analysis with Naïve Bayes

导入包

在下面的练习中，将使用朴素贝叶斯特征对推文数据集进行可视化检查，重点理解对数似然比=一对可输入机器学习算法的数字特征。

最后，将介绍置信度椭圆的概念，作为直观表示朴素贝叶斯模型的工具。

import numpy as np # Library for linear algebra and math utils
import pandas as pd # Dataframe library

import matplotlib.pyplot as plt # Library for plots
from utils import confidence_ellipse # Function to add confidence ellipses to charts

Calculate the likelihoods for each tweet

对于每条推文，我们都计算了该推文的正面可能性和负面可能性。下面给出可能性比率的分子和分母。
$\frac{P(tweet|pos)}{P(tweet|neg)} = log(P(tweet|pos)) - log(P(tweet|neg))$
$\sum_{i=0}^{n}{log P(W_i|pos)}$
$\sum_{i=0}^{n}{log P(W_i|neg)}$
以上公式对应的代码本次实验不做要求，但运行得到的结果放在：'bayes_features.csv’文件中。

data = pd.read_csv('./data/bayes_features.csv'); # Load the data from the csv file

data.head(5) # Print the first 5 tweets features. Each row represents a tweet

结果：
在这里插入图片描述
画图：

# Plot the samples using columns 1 and 2 of the matrix
fig, ax = plt.subplots(figsize = (8, 8)) #Create a new figure with a custom size

colors = ['red', 'green'] # Define a color palete
sentiments = ['negative', 'positive'] 

index = data.index

# Color base on sentiment
for sentiment in data.sentiment.unique():
    ix = index[data.sentiment == sentiment]
    ax.scatter(data.iloc[ix].positive, data.iloc[ix].negative, c=colors[int(sentiment)], s=0.1, marker='*', label=sentiments[int(sentiment)])

ax.legend(loc='best')    
    
# Custom limits for this chart
plt.xlim(-250,0)
plt.ylim(-250,0)

plt.xlabel("Positive") # x-axis label
plt.ylabel("Negative") # y-axis label
plt.show()

在这里插入图片描述

Using Confidence Ellipses to interpret Naïve Bayes

本节我们将使用置信度椭圆分析朴素贝叶斯的结果。

置信椭圆是可视化二维随机变量的一种方法。它比在直角坐标平面上绘制点更好，因为在大数据集上，点会严重重叠，从而掩盖数据的真实分布。置信度椭圆只需四个参数就能概括数据集的信息：

中心：中心：是属性的数值平均值。
高度和宽度：高度和宽度：与每个属性的方差有关。用户必须指定绘制椭圆所需的标准偏差量。
角度：与属性间的协方差有关。

参数 n_std 代表椭圆边界的标准差个数。请记住，对于正态随机分布来说

曲线下约 68% 的面积落在均值周围 1 个标准差的范围内。
约 95% 的曲线下面积在均值周围 2 个标准差以内。
约 99.7% 的曲线下面积在均值周围 3 个标准差以内。

在这里插入图片描述
下面代码将绘制2_std和3_std

# Plot the samples using columns 1 and 2 of the matrix
fig, ax = plt.subplots(figsize = (8, 8))

colors = ['red', 'green'] # Define a color palete
sentiments = ['negative', 'positive'] 
index = data.index

# Color base on sentiment
for sentiment in data.sentiment.unique():
    ix = index[data.sentiment == sentiment]
    ax.scatter(data.iloc[ix].positive, data.iloc[ix].negative, c=colors[int(sentiment)], s=0.1, marker='*', label=sentiments[int(sentiment)])

# Custom limits for this chart
plt.xlim(-200,40)  
plt.ylim(-200,40)

plt.xlabel("Positive") # x-axis label
plt.ylabel("Negative") # y-axis label

data_pos = data[data.sentiment == 1] # Filter only the positive samples
data_neg = data[data.sentiment == 0] # Filter only the negative samples

# Print confidence ellipses of 2 std
confidence_ellipse(data_pos.positive, data_pos.negative, ax, n_std=2, edgecolor='black', label=r'$2\sigma$' )
confidence_ellipse(data_neg.positive, data_neg.negative, ax, n_std=2, edgecolor='orange')

# Print confidence ellipses of 3 std
confidence_ellipse(data_pos.positive, data_pos.negative, ax, n_std=3, edgecolor='black', linestyle=':', label=r'$3\sigma$')
confidence_ellipse(data_neg.positive, data_neg.negative, ax, n_std=3, edgecolor='orange', linestyle=':')
ax.legend(loc='lower right')

plt.show()

在这里插入图片描述
下面，修改正例推文的特征，使其与负例重合：

data2 = data.copy() # Copy the whole data frame

# The following 2 lines only modify the entries in the data frame where sentiment == 1
#data2.negative[data.sentiment == 1] =  data2.negative * 1.5 + 50 # Modify the negative attribute
#data2.positive[data.sentiment == 1] =  data2.positive / 1.5 - 50 # Modify the positive attribute 
# 对于情感值为1的数据点，修改negative属性
data2.loc[data2.sentiment == 1, 'negative'] = data2.loc[data2.sentiment == 1, 'negative'] * 1.5 + 50

# 对于情感值为1的数据点，修改positive属性
data2.loc[data2.sentiment == 1, 'positive'] = data2.loc[data2.sentiment == 1, 'positive'] / 1.5 - 50

重新绘制图像：

# Plot the samples using columns 1 and 2 of the matrix
fig, ax = plt.subplots(figsize = (8, 8))

colors = ['red', 'green'] # Define a color palete
sentiments = ['negative', 'positive'] 
index = data2.index

# Color base on sentiment
for sentiment in data2.sentiment.unique():
    ix = index[data2.sentiment == sentiment]
    ax.scatter(data2.iloc[ix].positive, data2.iloc[ix].negative, c=colors[int(sentiment)], s=0.1, marker='*', label=sentiments[int(sentiment)])

#ax.scatter(data2.positive, data2.negative, c=[colors[int(k)] for k in data2.sentiment], s = 0.1, marker='*')  # Plot a dot for tweet
# Custom limits for this chart
plt.xlim(-200,40)  
plt.ylim(-200,40)

plt.xlabel("Positive") # x-axis label
plt.ylabel("Negative") # y-axis label

data_pos = data2[data2.sentiment == 1] # Filter only the positive samples
data_neg = data[data2.sentiment == 0] # Filter only the negative samples

# Print confidence ellipses of 2 std
confidence_ellipse(data_pos.positive, data_pos.negative, ax, n_std=2, edgecolor='black', label=r'$2\sigma$' )
confidence_ellipse(data_neg.positive, data_neg.negative, ax, n_std=2, edgecolor='orange')

# Print confidence ellipses of 3 std
confidence_ellipse(data_pos.positive, data_pos.negative, ax, n_std=3, edgecolor='black', linestyle=':', label=r'$3\sigma$')
confidence_ellipse(data_neg.positive, data_neg.negative, ax, n_std=3, edgecolor='orange', linestyle=':')
ax.legend(loc='lower right')

plt.show()

在这里插入图片描述

修改后，两个数据的分布开始重合。

oldmao_2000

关注

17
点赞
踩
10

收藏

觉得还不错? 一键收藏
打赏
0
评论
C1W2.LAB.Visualizing Naive Bayes

它比在直角坐标平面上绘制点更好，因为在大数据集上，点会严重重叠，从而掩盖数据的真实分布。在下面的练习中，将使用朴素贝叶斯特征对推文数据集进行可视化检查，重点理解对数似然比=一对可输入机器学习算法的数字特征。以上公式对应的代码本次实验不做要求，但运行得到的结果放在：'bayes_features.csv’文件中。对于每条推文，我们都计算了该推文的正面可能性和负面可能性。下面给出可能性比率的分子和分母。的概念，作为直观表示朴素贝叶斯模型的工具。修改后，两个数据的分布开始重合。分析朴素贝叶斯的结果。
复制链接

扫一扫