目录
一、概述
假设特征之间相互独立,求解在此样本出现的条件下各个类别出现的概率,哪个概率最大,就属于哪个类别
二、步骤
1.计算先验概率
先验概率:在考虑任何观测数据之前对事件发生的概率的估计。它是基于以往的经验、知识或其他信息得出的概率值。
2.计算条件概率
条件概率:表示在给定另一个事件已经发生的情况下,某一事件发生的可能性。条件概率通常用P(A|B)来表示,其中P(A|B)表示在事件B发生的条件下,事件A发生的概率。
3.应用贝叶斯定理
贝叶斯公式:
拉普拉斯修正:
当我们根据有限的数据集来估计事件发生的概率时,可能会出现某些事件在样本中未曾出现,导致其估计的概率为零。
拉普拉斯修正通过对所有事件的计数值加上一个小的平滑因子,以确保每个事件都有一个非零的概率估计
防溢出策略:
由于朴素贝叶斯算法中经常涉及多个概率相乘,为了避免小概率相乘导致的数值下溢出,可以采用对数概率来进行计算。对数概率可以将概率相乘转化为概率相加,有效避免了数值溢出的问题。
4.类别判定
样本出现的条件下各个类别出现的最大概率,哪个概率最大,就属于哪个类别
三、用朴素贝叶斯算法实现垃圾邮件分类
email下有ham和spam两类邮件
每类有25封邮件
其中一个正常邮件
import os
from collections import Counter
import numpy as np
from sklearn.model_selection import train_test_split
# 读取ham文件夹下的正常邮件
ham_emails = []
ham_path = 'email/ham'
for filename in os.listdir(ham_path):
with open(os.path.join(ham_path, filename), 'rb') as file:
email_content = file.read().decode('ISO-8859-1')
ham_emails.append(email_content)
# 读取spam文件夹下的垃圾邮件
spam_emails = []
spam_path = 'email/spam'
for filename in os.listdir(spam_path):
with open(os.path.join(spam_path, filename), 'rb') as file:
email_content = file.read().decode('ISO-8859-1')
spam_emails.append(email_content)
# 构建词典
all_words = ' '.join(ham_emails + spam_emails).split()
word_dict = Counter(all_words)
# 计算词频
def calculate_word_frequency(emails):
word_frequency = []
for email in emails:
email_words = email.split()
email_dict = Counter(email_words)
email_frequency = [email_dict[word] for word in word_dict]
word_frequency.append(email_frequency)
return word_frequency
# 划分数据集为训练集和测试集
X_ham = calculate_word_frequency(ham_emails)
X_spam = calculate_word_frequency(spam_emails)
X_train = np.array(X_ham + X_spam)
y_train = np.array([0]*len(ham_emails) + [1]*len(spam_emails))
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.2, random_state=42)
# 计算先验概率
def calculate_prior_probabilities(y_train):
prior_probabilities = {}
total_count = len(y_train)
for label in set(y_train):
prior_probabilities[label] = sum(y_train == label) / total_count
return prior_probabilities
#计算条件概率
def calculate_conditional_probabilities(X_train, y_train):
conditional_probabilities = {}
class_counts = Counter(y_train)
for label in set(y_train):
class_total = class_counts[label]
class_features = X_train[y_train == label]
conditional_probabilities[label] = (np.sum(class_features, axis=0) + 1) / (np.sum(class_features) + len(word_dict))
return conditional_probabilities
prior_probabilities = calculate_prior_probabilities(y_train)
conditional_probabilities = calculate_conditional_probabilities(X_train, y_train)
# 预测
def predict(X_test, prior_probabilities, conditional_probabilities):
predictions = []
for email in X_test:
ham_score = np.sum(np.log(conditional_probabilities[0]) * email) + np.log(prior_probabilities[0])
spam_score = np.sum(np.log(conditional_probabilities[1]) * email) + np.log(prior_probabilities[1])
if ham_score > spam_score:
predictions.append(0)
else:
predictions.append(1)
return predictions
y_pred = predict(X_test, prior_probabilities, conditional_probabilities)
print(y_test)
print(y_pred)
# 评估模型
from sklearn.metrics import accuracy_score
print("Accuracy:", accuracy_score(y_test, y_pred))
运行结果
总结:
优点:
1. 简单高效:朴素贝叶斯算法易于实现和理解,计算速度快,适用于大规模数据集。
2. 适用于多分类问题:朴素贝叶斯算法可以处理多分类问题,并且对输入数据中的噪声具有较好的鲁棒性。
缺点:
1. 特征条件独立性假设:朴素贝叶斯算法假设所有特征之间相互独立,这在某些实际应用场景下可能并不成立,导致分类性能下降。
2. 数据偏斜问题:当训练数据中的类别分布不平衡时,朴素贝叶斯算法的分类性能可能会受到影响。
3. 无法学习特征间的相互作用:由于朴素贝叶斯算法假设特征之间相互独立,因此无法学习到特征间的相互作用,这可能在某些情况下导致分类性能下降。