机器学习：朴素贝叶斯

最新推荐文章于 2024-10-10 12:45:12 发布

鱼不吃鱼

最新推荐文章于 2024-10-10 12:45:12 发布

阅读量72

点赞数 1

文章标签：机器学习人工智能

本文链接：https://blog.csdn.net/m0_63715536/article/details/134268662

版权

本文介绍了朴素贝叶斯方法的基本定义，其在假设特征间独立的前提下简化了贝叶斯分类，适合少量数据和多类别问题。文章详细阐述了算法原理，包括先验和后验概率计算，以及Python代码实现示例，最后强调了朴素贝叶斯在文本分类和NLP中的应用。

摘要由CSDN通过智能技术生成

一、定义

朴素贝叶斯方法是在贝叶斯算法的基础上进行了相应的简化，即假定给定目标值时属性之间相互条件独立。也就是说没有哪个属性变量对于决策结果来说占有着较大的比重，也没有哪个属性变量对于决策结果占有着较小的比重。虽然这个简化方式在一定程度上降低了贝叶斯分类算法的分类效果，但是在实际的应用场景中，极大地简化了贝叶斯方法的复杂性。

优点：在数据较少的情况下任然有效，可以处理多类别问题。

缺点：对于输入数据的准备方式较为敏感。

二、算法原理

贝叶斯公式：

先验概率P(X)、P(Y)：先验概率是指根据以往经验和分析得到的概率。

后验概率P(Y|X)：事情已经发生，要求这件事情发生的原因是由某个因素引起的可能性的大小，后验分布P(Y|X)表示事件X已经发生的前提下，事件Y发生的概率，叫做事件X发生下事件Y的条件概率。

后验概率P(X|Y)：通常它除以P(X)被叫做调整因子，可能性函数。在已知Y发生后X的条件概率，也由于知道Y的取值而被称为X的后验概率。

朴素：朴素贝叶斯算法是假设各个特征之间相互独立，然而现实生活中这样的事件不存在，也是朴素这词的意思，那么贝叶斯公式中的P(X|Y)可写成：

朴素贝叶斯公式：

三、代码实现

import math


# 计算词频
def calculate_word_frequency(documents, labels):
    word_freq = {}
    class_word_count = {}
    for document, label in zip(documents, labels):
        if label not in class_word_count:
            class_word_count[label] = 0
        class_word_count[label] += len(document)
        for word in document:
            if word not in word_freq:
                word_freq[word] = {}
            if label not in word_freq[word]:
                word_freq[word][label] = 0
            word_freq[word][label] += 1
    return word_freq, class_word_count


# 计算类别概率
def calculate_class_probabilities(labels):
    class_prob = {}
    total = len(labels)
    for label in labels:
        if label not in class_prob:
            class_prob[label] = 0
        class_prob[label] += 1
    for label in class_prob:
        class_prob[label] = class_prob[label] / total
    return class_prob


# 预测新文档的类别
def predict_class(document, word_freq, class_word_count, class_prob):
    best_label = None
    best_score = -math.inf
    for label in class_prob:
        score = math.log(class_prob[label])
        for word in document:
            if word in word_freq and label in word_freq[word]:
                prob = word_freq[word][label] / class_word_count[label]
                score += math.log(prob)
        if score > best_score:
            best_score = score
            best_label = label
    return best_label


# 示例用法
documents = [
    ['apple', 'banana', 'orange'],
    ['banana', 'orange', 'cat'],
    ['apple', 'orange', 'cat'],
    ['apple', 'banana']
]
labels = ['fruit', 'fruit', 'fruit', 'fruit']

word_freq, class_word_count = calculate_word_frequency(documents, labels)
class_prob = calculate_class_probabilities(labels)

new_document = ['apple', 'cat']
predicted_class = predict_class(new_document, word_freq, class_word_count, class_prob)
print("Predicted class:", predicted_class)

运行结果：

Predicted class: fruit