关于贝叶斯算法的原理,这里不详细展开,这里主要写Python代码实现高斯贝叶斯算法。
高斯贝叶斯 :GaussianNB 就是假设每个标签的数据服从简单的正态分布
p(X j = xj | Y = Ck) =
其中 为Y的第k类类别。
1.代码如下:
class GaussianNB:
def fit(self,X_train,Y_train):
self.mu = np.array([X_train[Y_train == i].mean(axis=0) for i in range(pd.Series(y).nunique())])
self.var_ = np.array([X_train[Y_train == i].std(axis=0) for i in range(pd.Series(y).nunique())])**2
return self
def predict(self,X_test):
self.result = [(np.exp(-(X_test[j]-self.mu)**2/(2*self.var_))/(2*math.pi * self.var_)**0.5) .prod(axis=1).argmax() for j in range(X_test.shape[0])]
return self.result
def score(self,X_test,Y_test):
accuracy = (np.array(self.result) == Y_test ) .sum() / Y_test.shape[0]
return accuracy
这里:我使用了大量的嵌套,其实,只是为了高度集成封装。另外,也是尽量使用numpy的计算,以提高计算速度。
2. 与sklearn的 GaussianNB对比
%%time
score1 = []
for i in range(10000):
X_train,X_test,Y_train,Y_test = train_test_split(X,y,test_size = 0.3,random_state = i)
GN = GaussianNB()
GN = GN.fit(X_train,Y_train)
result = GN.predict(X_test)
score1.append(GN.score(X_test,Y_test))
print("score1",np.mean(score1))
%%time
from sklearn.naive_bayes import GaussianNB
score2 = []
for i in range(10000):
X_train,X_test,Y_train,Y_test = train_test_split(X,y,test_size = 0.3,random_state = i)
GB = GaussianNB()
GB = GB.fit(X_train,Y_train)
result = GB.predict(X_test)
score2.append(GB.score(X_test,Y_test))
print("score2",np.mean(score2))
3.总结:
从准确率上来看,和sklearn的差不多,,但是计算效率上看,差距还是很大的,sklearn的代码优化性能强了不少,原因可能是我的代码用到了for,导致整体的速度降低。