经典机器学习模型:朴素贝叶斯分类
朴素贝叶斯分类基本数学基础
最近在追张子枫的新剧天才基本法 ,有一集里的一道小学奥数题非常有意思:三个完全相同的盒子,一个盒子装了两个红球,一个盒子里面装了两个蓝球,一个盒子里装了一红一蓝两个球。从三个盒子中随机选择了一个盒子,从里面拿出了一个球发现是红色的,问这个盒子里剩下的那个球是红色的概率有多大。先说答案概率是 2 3 {2\over 3} 32。
解题过程会用到贝叶斯公式 P ( A ∣ B ) = P ( A ∩ B ) P ( B ) P(A|B) =\frac {P(A\cap B)}{P(B)} P(A∣B)=P(B)P(A∩B)
贝叶斯公式本质上理解就是在特定样本空间下发生某事件的概率。对应到这道题就是在第一个球是红球的样本空间下,第二个球也是红色的概率。其中
- P ( A ∩ B ) = 1 3 P(A\cap B) = \frac {1}{3} P(A∩B)=31
- $P(B) = \frac {1}{3} * 0 + \frac {1}{3} * 1 + \frac {1}{3} *\frac {1}{2} = \frac {1}{2} $
直接套公式,答案是 2 3 {2\over 3} 32
奥数太难了,感觉没有数学天赋真的没必要浪费时间。对于贝叶斯公式的理解,我感觉我是工作后才浅浅的明白一些
朴素贝叶斯分类器的数学基础就是这个贝叶斯公式。在贝叶斯分类中,我们希望确定一个具有某些特征的样本属于某个标签的概率。我们这里记为 P ( L a b e l ∣ F e a t u r e s ) P(Label|Features) P(Label∣Features).贝叶斯定理告诉我们,可以直接用下列这个公式来计算: P ( L a b e l ∣ F e a t u r e s ) = P ( F e a t u r e s ∣ L a b e l ) P ( L a b e l ) P ( F e a t u r e s ) P(Label|Features) = \frac {P(Features|Label)P(Label)}{P(Features)} P(Label∣Features)=P(Features)P(Features∣Label)P(Label)
如果要判断这个样本具体属于哪个标签,其实就是算出在该样本具有的feature下,属于每个标签的概率,然后选择概率最大的即为最终的标签。 P ( L a b e l ) P(Label) P(Label)是比较好计算的,通过样本数据的频次就可以统计出来。 P ( F e a t u r e s ∣ L a b e l ) = P ( F e a t u r e 1 ∣ L a b e l ) ∗ P ( F e a t u r e 2 ∣ L a b e l , F e a t u r e 1 ) . . . P(Features|Label) = P(Feature_1|Label)* P(Feature_2|Label,Feature_1)... P(Features∣Label)=P(Feature1∣Label)∗P(Feature2∣Label,Feature1)....朴素贝叶斯之所以叫naive model,就是基于一个基本假设即feature之间是相互独立的,因此上述公式就可以简化为 P ( F e a t u r e s ∣ L a b e l ) = P ( F e a t u r e 1 ∣ L a b e l ) ∗ P ( F e a t u r e 2 ∣ L a b e l ) . . . P(Features|Label) = P(Feature_1|Label)* P(Feature_2|Label)... P(Features∣Label)=P(Feature1∣Label)∗P(Feature2∣Label)...。
基于上面提到的基本假设,基本上朴素贝叶斯分类器的训练效果是比复杂模型差的。但是它也有优点,比如:
- 天生支持概率分类,可以直接输出属于某个标签的概率大小
- 模型简单,可调参数很少
所以基本上,朴素贝叶斯分类会用作业务分析的base model。如果分类效果满足要求,那么happy ending,如果不够好,就要看下其他模型再训练了。
Demo
from sklearn.datasets import fetch_20newsgroups
data = fetch_20newsgroups()
categories = ['talk.politics.guns','soc.religion.christian','sci.space']
train = fetch_20newsgroups(subset = 'train',categories=categories)
test = fetch_20newsgroups(subset = 'test',categories=categories)
选其中一个sample看下数据
print(train.data[1])
From: shepard@netcom.com (Mark Shepard)
Subject: S414 (Brady bill) loopholes?
Keywords: brady handguns s414 hr1025 hr277 instant check waiting period
Organization: NETCOM On-line Communication Services (408 241-9760 guest)
Distribution: na
Lines: 40
Hi. I've just finished reading S414, and have several questions about
the Brady bills (S414 and HR1025).
1. _Are_ these the current versions of the Brady bill?
What is the status of these bills? I've heard they're "in committee".
How close is that to being made law?
2. S414 and HR1025 seem fairly similar. Are there any important
differences I missed?
3. S414 seems to have some serious loopholes:
A. S414 doesn't specify an "appeals" process to wrongful denial during
the waiting period, other than a civil lawsuit(?) (S414 has an appeals
process once the required instant background check system is established,
but not before).
B. the police are explicitly NOT liable for mistakes in denying/approving
using existing records (so who would I sue in "A" above to have an
inaccurate record corrected?)
C. S414 includes an exception-to-waiting-period clause for if a person
can convince the local Chief Law-Enforcement Officer (CLEO) of an
immediate threat to his or her life, or life of a household member.
But S414 doesn't say exactly what is considered a "threat", nor does
it place a limit on how long the CLEO takes to issue an exception
statement.
True? Have I misunderstood? Any other 'holes?
4. With just S414, what's to stop a person with a "clean" record from
buying guns, grinding off the serial numbers, and selling them to crooks?
At minimum, what additional laws are needed to prevent this?
'Seems at min. a "gun counting" scheme would be needed
(e.g., "John Doe owns N guns"). So, if S414 passes, I wouldn't be surprised
to see legislation for stricter, harder-to-forge I.D.'s plus national gun
registration, justified by a need to make the Brady bill work.
Please comment. I'm mainly interested in specific problems with the current
legislation--I don't mean to start a general discussion of the merits
of any/all waiting-period bills ever proposed.
MarkS || shepard@netcom.com
字符是不能输入模型的,需要将每个字符转换为vector。
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
import seaborn as sns
import matplotlib.pyplot as plt
model = make_pipeline(TfidfVectorizer(),MultinomialNB())
model.fit(train.data,train.target)
labels = model.predict(test.data)
from sklearn.metrics import f1_score,confusion_matrix
f1score = f1_score(test.target,labels,average='macro')
f1score
0.9578751578149619
模型虽然很简单,但是看效果还是不错的。我们具体看下confusion matrix的分布
mat = confusion_matrix(test.target,labels)
sns.heatmap(mat.T,square=True,annot=True,fmt='d',cbar=False,xticklabels=train.target_names,yticklabels=train.target_names)
plt.xlabel('True Label')
plt.ylabel('Predicted Label')