经典机器学习模型：朴素贝叶斯分类

最新推荐文章于 2024-07-26 08:00:00 发布

Sarah_07

最新推荐文章于 2024-07-26 08:00:00 发布

阅读量380

点赞数

文章标签：机器学习分类算法

本文链接：https://blog.csdn.net/Sarah_07/article/details/126150795

版权

经典机器学习模型：朴素贝叶斯分类

朴素贝叶斯分类基本数学基础

最近在追张子枫的新剧天才基本法 ，有一集里的一道小学奥数题非常有意思：三个完全相同的盒子，一个盒子装了两个红球，一个盒子里面装了两个蓝球，一个盒子里装了一红一蓝两个球。从三个盒子中随机选择了一个盒子，从里面拿出了一个球发现是红色的，问这个盒子里剩下的那个球是红色的概率有多大。先说答案概率是 ${2\over 3}$ 。

解题过程会用到贝叶斯公式 $=\frac {P(A\cap B)}{P(B)}$

贝叶斯公式本质上理解就是在特定样本空间下发生某事件的概率。对应到这道题就是在第一个球是红球的样本空间下，第二个球也是红色的概率。其中

$P(A\cap B) = \frac {1}{3}$
$P(B) = \frac {1}{3} * 0 + \frac {1}{3} * 1 + \frac {1}{3} *\frac {1}{2} = \frac {1}{2} $
直接套公式，答案是 ${2\over 3}$

奥数太难了，感觉没有数学天赋真的没必要浪费时间。对于贝叶斯公式的理解，我感觉我是工作后才浅浅的明白一些

朴素贝叶斯分类器的数学基础就是这个贝叶斯公式。在贝叶斯分类中，我们希望确定一个具有某些特征的样本属于某个标签的概率。我们这里记为 $P (L ab e l ∣ F e a t u res)$ .贝叶斯定理告诉我们，可以直接用下列这个公式来计算： $\frac {P(Features|Label)P(Label)}{P(Features)}$

如果要判断这个样本具体属于哪个标签，其实就是算出在该样本具有的feature下，属于每个标签的概率，然后选择概率最大的即为最终的标签。 $P (L ab e l)$ 是比较好计算的，通过样本数据的频次就可以统计出来。 $P(Features|Label) = P(Feature_1|Label)* P(Feature_2|Label,Feature_1)...$ .朴素贝叶斯之所以叫naive model，就是基于一个基本假设即feature之间是相互独立的，因此上述公式就可以简化为 $P(Features|Label) = P(Feature_1|Label)* P(Feature_2|Label)...$ 。

基于上面提到的基本假设，基本上朴素贝叶斯分类器的训练效果是比复杂模型差的。但是它也有优点，比如：

天生支持概率分类，可以直接输出属于某个标签的概率大小
模型简单，可调参数很少
所以基本上，朴素贝叶斯分类会用作业务分析的base model。如果分类效果满足要求，那么happy ending，如果不够好，就要看下其他模型再训练了。

Demo

from sklearn.datasets import fetch_20newsgroups

data = fetch_20newsgroups()
categories = ['talk.politics.guns','soc.religion.christian','sci.space']

train = fetch_20newsgroups(subset = 'train',categories=categories)
test = fetch_20newsgroups(subset = 'test',categories=categories)

选其中一个sample看下数据

print(train.data[1])

From: shepard@netcom.com (Mark Shepard)
Subject: S414 (Brady bill) loopholes?
Keywords: brady handguns s414 hr1025 hr277 instant check waiting period
Organization: NETCOM On-line Communication Services (408 241-9760 guest)
Distribution: na
Lines: 40

Hi. I've just finished reading S414, and have several questions about
the Brady bills (S414 and HR1025).

1. _Are_ these the current versions of the Brady bill?
     What is the status of these bills?  I've heard they're "in committee".
     How close is that to being made law?

2. S414 and HR1025 seem fairly similar.  Are there any important
   differences I missed?

3. S414 seems to have some serious loopholes:
  A. S414 doesn't specify an "appeals" process to wrongful denial during
     the waiting period, other than a civil lawsuit(?)  (S414 has an appeals
     process once the required instant background check system is established,
     but not before).
  B. the police are explicitly NOT liable for mistakes in denying/approving
     using existing records (so who would I sue in "A" above to have an
     inaccurate record corrected?)
  C. S414 includes an exception-to-waiting-period clause for if a person
     can convince the local Chief Law-Enforcement Officer (CLEO) of an
     immediate threat to his or her life, or life of a household member.
     But S414 doesn't say exactly what is considered a "threat", nor does
     it place a limit on how long the CLEO takes to issue an exception
     statement.
True?  Have I misunderstood?  Any other 'holes?

4. With just S414, what's to stop a person with a "clean" record from
   buying guns, grinding off the serial numbers, and selling them to crooks?
   At minimum, what additional laws are needed to prevent this?

   'Seems at min. a "gun counting" scheme would be needed
   (e.g., "John Doe owns N guns").  So, if S414 passes, I wouldn't be surprised
   to see legislation for stricter, harder-to-forge I.D.'s plus national gun
   registration, justified by a need to make the Brady bill work.

Please comment.  I'm mainly interested in specific problems with the current
legislation--I don't mean to start a general discussion of the merits
of any/all waiting-period bills ever proposed.

	MarkS || shepard@netcom.com

字符是不能输入模型的，需要将每个字符转换为vector。

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
import seaborn as sns
import matplotlib.pyplot as plt

model = make_pipeline(TfidfVectorizer(),MultinomialNB())
model.fit(train.data,train.target)
labels = model.predict(test.data)

from sklearn.metrics import f1_score,confusion_matrix
f1score = f1_score(test.target,labels,average='macro')
f1score

0.9578751578149619

模型虽然很简单，但是看效果还是不错的。我们具体看下confusion matrix的分布

mat = confusion_matrix(test.target,labels)
sns.heatmap(mat.T,square=True,annot=True,fmt='d',cbar=False,xticklabels=train.target_names,yticklabels=train.target_names)
plt.xlabel('True Label')
plt.ylabel('Predicted Label')