1.条件概率
在学习计算p1 和p2概率之前,我们需要了解什么是条件概率
(Conditional probability),就是指在事件B发生的情况下,事件A发生的概率,用P(A|B)
来表示。
根据上图,可以很清楚地看到在事件B发生的情况下,事件A发生的概率就是P(A∩B)
除以P(B)
2.全概率公式
假定样本空间S,是两个事件A与A’的和。
上图中,红色部分是事件A,绿色部分是事件A’,它们共同构成了样本空间S。
在这种情况下,事件B可以划分成两个部分。
这就是全概率公式。它的含义是,如果
A
A
A和
A
′
A'
A′构成样本空间的一个划分,那么事件
B
B
B的概率,就等于
A
A
A和
A
′
A'
A′的概率分别乘以
B
B
B对这两个事件的条件概率之和。
将这个公式代入上一节的条件概率公式,就得到了条件概率的另一种写法:
3.贝叶斯推断
对条件概率公式进行变形,可以得到如下形式:
我们把P(A)称为"先验概率"(Prior probability),即在B事件发生之前,我们对A事件概率的一个判断。
P(A|B)称为"后验概率"(Posterior probability),即在B事件发生之后,我们对A事件概率的重新评估。
P(B|A)/P(B)称为"可能性函数"(Likelyhood),这是一个调整因子,使得预估概率更接近真实概率。
这就是贝叶斯推断的含义
。我们先预估一个"先验概率"(分成每一类的概率),然后加入实验结果,看这个实验到底是增强还是削弱了"先验概率",由此得到更接近事实的"后验概率"。
在这里,如果"可能性函数"P(B|A)/P(B)>1
,意味着"先验概率"被增强,事件A的发生的可能性变大;如果"可能性函数"=1,意味着B事件无助于判断事件A的可能性;如果"可能性函数"<1,意味着"先验概率"被削弱,事件A的可能性变小。
为了加深对贝叶斯推断的理解,我们举一个例子。
两个一模一样的碗,一号碗有30颗水果糖和10颗巧克力糖,二号碗有水果糖和巧克力糖各20颗。现在随机选择一个碗,从中摸出一颗糖,发现是水果糖。请问这颗水果糖来自一号碗的概率有多大?
我们假定,H1表示一号碗,H2表示二号碗。由于这两个碗是一样的,所以P(H1)=P(H2),也就是说,在取出水果糖之前,这两个碗被选中的概率相同。因此,P(H1)=0.5,我们把这个概率就叫做"先验概率",即没有做实验之前,来自一号碗的概率是0.5。
再假定,E表示水果糖,所以问题就变成了在已知E的情况下,来自一号碗的概率有多大,即求P(H1|E)。我们把这个概率叫做"后验概率",即在E事件发生之后,对P(H1)的修正。
可以这样理解
- 1、两个类别分别为H1,H2
- 2、特征数据为
[[30,10],[20,20]]
,对应的类别为[H1,H2]
- 3、现在有条特征数据为
[[1,0]]
,求它的类别(分成H1的概率值)
贝叶斯求解法:(E为水果糖,Q为巧克力糖)
我们要求解的问题为: P ( H 1 ∣ E ) P(H_1|E) P(H1∣E)
-
1、计算每个类别发生的概率(
先验概率
) ,P(H1)=1/2=0.5
,P(H2)=1/2=0.5
-
2、计算
后验概率
- P ( H 1 ∣ E ) = 30 / ( 30 + 20 ) = 0.6 P(H_1|E)=30/(30+20)=0.6 P(H1∣E)=30/(30+20)=0.6
- P ( H 1 ∣ Q ) = 10 / ( 10 + 20 ) = 1 3 P(H_1|Q)=10/(10+20)=\frac{1}{3} P(H1∣Q)=10/(10+20)=31
- P ( H 2 ∣ E ) = 20 / ( 20 + 30 ) = 0.4 P(H_2|E)=20/(20+30)=0.4 P(H2∣E)=20/(20+30)=0.4
- P ( H 2 ∣ Q ) = 20 / ( 10 + 20 ) = 2 3 P(H_2|Q)=20/(10+20)=\frac{2}{3} P(H2∣Q)=20/(10+20)=32
-
3、 P ( E ∣ H 1 ) = 30 / 40 = 3 4 P(E|H_1) =30/40=\frac{3}{4} P(E∣H1)=30/40=43
- P ( Q ∣ H 1 ) = 10 / 40 = 1 4 P(Q|H_1) =10/40=\frac{1}{4} P(Q∣H1)=10/40=41
- P ( E ∣ H 2 ) = 20 / 40 = 1 2 P(E|H_2) =20/40=\frac{1}{2} P(E∣H2)=20/40=21
- P ( Q ∣ H 2 ) = 20 / 40 = 1 2 P(Q|H_2) =20/40=\frac{1}{2} P(Q∣H2)=20/40=21
-
4、根据全概率公式:
- P ( E ) = P ( E ∣ H 1 ) P ( H 1 ) + P ( E ∣ H 2 ) P ( H 2 ) = 3 4 ∗ 0.5 + 1 2 ∗ 0.5 = 0.625 P(E) = P(E|H_1)P(H_1)+P(E|H_2)P(H_2)=\frac{3}{4}*0.5+\frac{1}{2}*0.5=0.625 P(E)=P(E∣H1)P(H1)+P(E∣H2)P(H2)=43∗0.5+21∗0.5=0.625,
- P ( Q ) = P ( Q ∣ H 1 ) P ( H 1 ) + P ( Q ∣ H 2 ) P ( H 2 ) = 1 4 ∗ 0.5 + 1 2 ∗ 0.5 = 0.3125 P(Q) = P(Q|H_1)P(H_1)+P(Q|H_2)P(H_2)=\frac{1}{4}*0.5+\frac{1}{2}*0.5=0.3125 P(Q)=P(Q∣H1)P(H1)+P(Q∣H2)P(H2)=41∗0.5+21∗0.5=0.3125
-
5、 P ( H 1 ∣ E ) = P ( H 1 ) P ( E ∣ H 1 ) P ( E ) = 0.5 ∗ 0.75 0.625 = 0.6 P(H_1|E) = P(H_1)\frac{P(E|H_1)}{P(E)}=0.5*\frac{0.75}{0.625}=0.6 P(H1∣E)=P(H1)P(E)P(E∣H1)=0.5∗0.6250.75=0.6
4.朴素贝叶斯推断
朴素贝叶斯对条件个概率分布做了条件独立性的假设。 比如下面的公式,假设有n个特征:
某个医院早上来了六个门诊的病人,他们的情况如下表所示:
现在又来了第七个病人,是一个打喷嚏的建筑工人。请问他患上感冒的概率有多大?
根据朴素贝叶斯条件独立性的假设可知,"打喷嚏"和"建筑工人"这两个特征是独立的,因此,上面的等式就变成了
这就是贝叶斯分类器的基本方法:在统计资料的基础上,依据某些特征,计算各个类别的概率,从而实现分类。
同样,在编程的时候,如果不需要求出所属类别的具体概率,P(打喷嚏) = 0.5和P(建筑工人) = 0.33的概率是可以不用求的。
P ( 打 喷 嚏 ) = P ( 感 冒 ) ∗ P ( 打 喷 嚏 ∣ 感 冒 ) + P ( 过 敏 ) ∗ P ( 打 喷 嚏 ∣ 过 敏 ) + P ( 脑 震 荡 ) P ( 打 喷 嚏 ∣ 脑 震 荡 ) = 3 / 6 ∗ 2 / 3 + 1 / 6 ∗ 1 / 1 + 2 / 6 ∗ 0 / 2 = 0.5 P(打喷嚏) = P(感冒)*P(打喷嚏|感冒)+P(过敏)*P(打喷嚏|过敏)+P(脑震荡)P(打喷嚏|脑震荡)=3/6*2/3+1/6*1/1+2/6*0/2=0.5 P(打喷嚏)=P(感冒)∗P(打喷嚏∣感冒)+P(过敏)∗P(打喷嚏∣过敏)+P(脑震荡)P(打喷嚏∣脑震荡)=3/6∗2/3+1/6∗1/1+2/6∗0/2=0.5
5.实践
简单文本分类
"""
Author:wucng
Time: 20200109
Summary: 朴素贝叶斯分类
源代码: https://github.com/wucng/MLAndDL
参考:https://cuijiahua.com/blog/2017/11/ml_4_bayes_1.html
"""
import numpy as np
from functools import reduce
from collections import Counter
import pickle,os
# 1.加载数据
def loadDataSet():
postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'], #切分的词条
['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
['stop', 'posting', 'stupid', 'worthless', 'garbage'],
['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
classVec = [0,1,0,1,0,1] #类别标签向量,1代表侮辱性词汇,0代表不是
return postingList,classVec
# 2.文本向量化
def createVocabList(dataSet:list)->list:
# 先建立词汇表
vocabSet = set([])
for data in dataSet:
vocabSet = vocabSet | set(data) # 并集
return sorted(list(vocabSet))
def word2vec(vocabList,dataSet):
vecs = np.zeros([len(dataSet),len(vocabList)])
for i,data in enumerate(dataSet):
for word in data:
if word in vocabList:
vecs[i,vocabList.index(word)] = 1 # 标记为1
return vecs
class NaiveBayesClassifier(object):
def __init__(self,save_file="model.ckpt"):
self.save_file = save_file
def fit(self,X:np.array,y:np.array):
if not os.path.exists(self.save_file):
# 计算分成每个类别的概率值
dict_y = dict(Counter(y))
dict_y = {k:v/len(y) for k,v in dict_y.items()}
# 计算每维特征每个特征值发生概率值
unique_label = list(set(y))
dict_feature_value={} # 每个特征每个值对应的概率
for col in range(len(X[0])):
data = X[...,col] # 每列特征
unique_val = list(set(data))
for val in unique_val:
dict_feature_value[str(col)+"_"+str(val)] = np.sum(data==val)/len(data)
dict_feature_value_label = {} # 每个类别发生对应的每个特征每个值的概率
for label in unique_label:
datas = X[y==label]
for col in range(len(datas[0])):
data = datas[..., col] # 每列特征
unique_val = list(set(data))
for val in unique_val:
dict_feature_value_label[str(label)+"_"+str(col)+"_"+str(val)]=np.sum(data==val)/len(data)
# save
result={"dict_y":dict_y,"dict_feature_value":dict_feature_value,
"dict_feature_value_label":dict_feature_value_label}
pickle.dump(result,open(self.save_file,"wb"))
# return dict_y,dict_feature_value,dict_feature_value_label
def __predict(self,X:np.array):
data = pickle.load(open(self.save_file,"rb"))
dict_y, dict_feature_value, dict_feature_value_label = data["dict_y"],data["dict_feature_value"],\
data["dict_feature_value_label"]
labels = sorted(list(dict_y.keys()))
# 计算每条数据分成每个类别的概率值
preds = np.zeros([len(X),len(labels)])
for i,x in enumerate(X):
for j,label in enumerate(labels):
p1 = 1
p2 = 1
for col,val in enumerate(x):
p1*= dict_feature_value_label[str(label)+"_"+str(col)+"_"+str(val)] if str(label)+"_"+str(col)+"_"+str(val) \
in dict_feature_value_label else self.__weighted_average(str(label)+"_"+str(col)+"_"+str(val),dict_feature_value_label)
p2*= dict_feature_value[str(col)+"_"+str(val)] if str(col)+"_"+str(val) in dict_feature_value else \
self.__weighted_average(str(col)+"_"+str(val),dict_feature_value)
preds[i,j] = p1*dict_y[label]/p2
return preds
def __fixed_value(self):
return 1e-3
def __weighted_average(self,key:str,data_dict:dict):
"""插值方式找到离该key对应的最近的data_dict中的key做距离加权平均"""
tmp = key.split("_")
value = float(tmp[-1])
if len(tmp)==3:
tmp_key = tmp[0]+"_"+tmp[1]+"_"
else:
tmp_key = tmp[0] + "_"
# 找到相关的key
# related_keys = []
values = [value]
for k in list(data_dict.keys()):
if tmp_key in k:
# related_keys.append(k)
values.append(float(k.split("_")[-1]))
# 做距离加权
values = sorted(values)
index = values.index(value)
# 取其前一个和后一个做插值
last = max(0,index-1)
next = min(index+1,len(values)-1)
if index==last or index==next:
return self.__fixed_value()
else:
d1=abs(values[last] - value)
d2=abs(values[next] - value)
v1 = data_dict[tmp_key+str(values[last])]
v2 = data_dict[tmp_key+str(values[next])]
# 距离加权 y=e^(-x)
return (np.log(d1)*v1+np.log(d2)*v2)/(np.log(d1)+np.log(d2))
def predict_proba(self,X:np.array):
return self.__predict(X)
def predict(self,X:np.array):
return np.argmax(self.__predict(X),-1)
def accuracy(self,y_true:np.array,y_pred:np.array)->float:
return round(np.sum(y_pred==y_true)/len(y_pred),5)
if __name__=="__main__":
dataset,label = loadDataSet()
vocabList = createVocabList(dataset)
dataset = word2vec(vocabList,dataset)
label = np.asarray(label)
# print(dataset.shape,label.shape) # (6, 32) (6,)
clf = NaiveBayesClassifier()
clf.fit(dataset,label)
print(clf.predict(dataset))
# [0 1 0 1 0 1]
iris数据分类
"""
Author:wucng
Time: 20200110
Summary: 朴素贝叶斯对iris数据分类
源代码: https://github.com/wucng/MLAndDL
参考:https://cuijiahua.com/blog/2017/11/ml_4_bayes_1.html
"""
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler,MinMaxScaler
from sklearn.neighbors import KNeighborsRegressor,KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score,auc
import pandas as pd
import numpy as np
from functools import reduce
from collections import Counter
import pickle,os,time
# 1.加载数据集(并做预处理)
def loadData(dataPath: str) -> tuple:
# 如果有标题可以省略header,names ;sep 为数据分割符
df = pd.read_csv(dataPath, sep=",", header=-1,
names=["sepal_length", "sepal_width", "petal_length", "petal_width", "label"])
# 填充缺失值
df = df.fillna(0)
# 数据量化
# 文本量化
df.replace("Iris-setosa", 0, inplace=True)
df.replace("Iris-versicolor", 1, inplace=True)
df.replace("Iris-virginica", 2, inplace=True)
# 划分出特征数据与标签数据
X = df.drop("label", axis=1) # 特征数据
y = df.label # or df["label"] # 标签数据
# 数据归一化
X = (X - np.min(X, axis=0)) / (np.max(X, axis=0) - np.min(X, axis=0))
# 使用sklearn方式
# X = MinMaxScaler().transform(X)
# 查看df信息
# df.info()
# df.describe()
return (X.to_numpy(), y.to_numpy())
class NaiveBayesClassifier(object):
def __init__(self,save_file="model.ckpt"):
self.save_file = save_file
def fit(self,X:np.array,y:np.array):
if not os.path.exists(self.save_file):
# 计算分成每个类别的概率值
dict_y = dict(Counter(y))
dict_y = {k:v/len(y) for k,v in dict_y.items()}
# 计算每维特征每个特征值发生概率值
unique_label = list(set(y))
dict_feature_value={} # 每个特征每个值对应的概率
for col in range(len(X[0])):
data = X[...,col] # 每列特征
unique_val = list(set(data))
for val in unique_val:
dict_feature_value[str(col)+"_"+str(val)] = np.sum(data==val)/len(data)
dict_feature_value_label = {} # 每个类别发生对应的每个特征每个值的概率
for label in unique_label:
datas = X[y==label]
for col in range(len(datas[0])):
data = datas[..., col] # 每列特征
unique_val = list(set(data))
for val in unique_val:
dict_feature_value_label[str(label)+"_"+str(col)+"_"+str(val)]=np.sum(data==val)/len(data)
# save
result={"dict_y":dict_y,"dict_feature_value":dict_feature_value,
"dict_feature_value_label":dict_feature_value_label}
pickle.dump(result,open(self.save_file,"wb"))
# return dict_y,dict_feature_value,dict_feature_value_label
def __predict(self,X:np.array):
data = pickle.load(open(self.save_file,"rb"))
dict_y, dict_feature_value, dict_feature_value_label = data["dict_y"],data["dict_feature_value"],\
data["dict_feature_value_label"]
labels = sorted(list(dict_y.keys()))
# 计算每条数据分成每个类别的概率值
preds = np.zeros([len(X),len(labels)])
for i,x in enumerate(X):
for j,label in enumerate(labels):
p1 = 1
p2 = 1
for col,val in enumerate(x):
p1*= dict_feature_value_label[str(label)+"_"+str(col)+"_"+str(val)] if str(label)+"_"+str(col)+"_"+str(val) \
in dict_feature_value_label else self.__weighted_average(str(label)+"_"+str(col)+"_"+str(val),dict_feature_value_label) # self.__fixed_value()
p2*= dict_feature_value[str(col)+"_"+str(val)] if str(col)+"_"+str(val) in dict_feature_value else \
self.__weighted_average(str(col)+"_"+str(val),dict_feature_value) # self.__fixed_value()
preds[i,j] = p1*dict_y[label]/p2
return preds
def __fixed_value(self):
return 1e-3
def __weighted_average(self,key:str,data_dict:dict):
"""插值方式找到离该key对应的最近的data_dict中的key做距离加权平均"""
tmp = key.split("_")
value = float(tmp[-1])
if len(tmp)==3:
tmp_key = tmp[0]+"_"+tmp[1]+"_"
else:
tmp_key = tmp[0] + "_"
# 找到相关的key
# related_keys = []
values = [value]
for k in list(data_dict.keys()):
if tmp_key in k:
# related_keys.append(k)
values.append(float(k.split("_")[-1]))
# 做距离加权
values = sorted(values)
index = values.index(value)
# 取其前一个和后一个做插值
last = max(0,index-1)
next = min(index+1,len(values)-1)
if index==last or index==next:
return self.__fixed_value()
else:
d1=abs(values[last] - value)
d2=abs(values[next] - value)
v1 = data_dict[tmp_key+str(values[last])]
v2 = data_dict[tmp_key+str(values[next])]
# 距离加权 y=e^(-x)
return (np.log(d1)*v1+np.log(d2)*v2)/(np.log(d1)+np.log(d2))
def predict_proba(self,X:np.array):
return self.__predict(X)
def predict(self,X:np.array):
return np.argmax(self.__predict(X),-1)
def accuracy(self,y_true:np.array,y_pred:np.array)->float:
return round(np.sum(y_pred==y_true)/len(y_pred),5)
if __name__=="__main__":
dataPath = "../../dataset/iris.data"
X, y = loadData(dataPath)
# 划分训练集与测试集
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=40)
start = time.time()
clf = NaiveBayesClassifier()
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print("cost time:%.6f(s) acc:%.3f"%(time.time()-start,clf.accuracy(y_test,y_pred)))
# cost time:0.012998(s) acc:1.000
# 使用sklearn 的GaussianNB
start = time.time()
clf = GaussianNB()
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print("cost time:%.6f(s) acc:%.3f" % (time.time() - start, accuracy_score(y_test, y_pred)))
# cost time:0.000996(s) acc:1.000
titanic分类
"""
Author:wucng
Time: 20200110
Summary: 朴素贝叶斯对titanic数据分类
源代码: https://github.com/wucng/MLAndDL
参考:https://cuijiahua.com/blog/2017/11/ml_4_bayes_1.html
"""
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler,MinMaxScaler
# from sklearn.neighbors import KNeighborsRegressor,KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score,auc
import pandas as pd
import numpy as np
from functools import reduce
from collections import Counter
import pickle,os,time
# 1.加载数据集(并做预处理)
def loadData(dataPath: str) -> tuple:
# 如果有标题可以省略header,names ;sep 为数据分割符
df = pd.read_csv(dataPath, sep=",")
# 填充缺失值
df["Age"] = df["Age"].fillna(df["Age"].median())
df['Embarked'] = df['Embarked'].fillna('S')
# df = df.fillna(0)
# 数据量化
# 文本量化
df.replace("male", 0, inplace=True)
df.replace("female", 1, inplace=True)
df.loc[df["Embarked"] == "S", "Embarked"] = 0
df.loc[df["Embarked"] == "C", "Embarked"] = 1
df.loc[df["Embarked"] == "Q", "Embarked"] = 2
# 划分出特征数据与标签数据
X = df.drop(["PassengerId","Survived","Name","Ticket","Cabin"], axis=1) # 特征数据
y = df.Survived # or df["Survived"] # 标签数据
# 数据归一化
X = (X - np.min(X, axis=0)) / (np.max(X, axis=0) - np.min(X, axis=0))
# 使用sklearn方式
# X = MinMaxScaler().transform(X)
# 查看df信息
# df.info()
# df.describe()
return (X.to_numpy(), y.to_numpy())
class NaiveBayesClassifier(object):
def __init__(self,save_file="model.ckpt"):
self.save_file = save_file
def fit(self,X:np.array,y:np.array):
if not os.path.exists(self.save_file):
# 计算分成每个类别的概率值
dict_y = dict(Counter(y))
dict_y = {k:v/len(y) for k,v in dict_y.items()}
# 计算每维特征每个特征值发生概率值
unique_label = list(set(y))
dict_feature_value={} # 每个特征每个值对应的概率
for col in range(len(X[0])):
data = X[...,col] # 每列特征
unique_val = list(set(data))
for val in unique_val:
dict_feature_value[str(col)+"_"+str(val)] = np.sum(data==val)/len(data)
dict_feature_value_label = {} # 每个类别发生对应的每个特征每个值的概率
for label in unique_label:
datas = X[y==label]
for col in range(len(datas[0])):
data = datas[..., col] # 每列特征
unique_val = list(set(data))
for val in unique_val:
dict_feature_value_label[str(label)+"_"+str(col)+"_"+str(val)]=np.sum(data==val)/len(data)
# save
result={"dict_y":dict_y,"dict_feature_value":dict_feature_value,
"dict_feature_value_label":dict_feature_value_label}
pickle.dump(result,open(self.save_file,"wb"))
# return dict_y,dict_feature_value,dict_feature_value_label
def __predict(self,X:np.array):
data = pickle.load(open(self.save_file,"rb"))
dict_y, dict_feature_value, dict_feature_value_label = data["dict_y"],data["dict_feature_value"],\
data["dict_feature_value_label"]
labels = sorted(list(dict_y.keys()))
# 计算每条数据分成每个类别的概率值
preds = np.zeros([len(X),len(labels)])
for i,x in enumerate(X):
for j,label in enumerate(labels):
p1 = 1
p2 = 1
for col,val in enumerate(x):
p1*= dict_feature_value_label[str(label)+"_"+str(col)+"_"+str(val)] if str(label)+"_"+str(col)+"_"+str(val) \
in dict_feature_value_label else self.__weighted_average(str(label)+"_"+str(col)+"_"+str(val),dict_feature_value_label) # self.__fixed_value()
p2*= dict_feature_value[str(col)+"_"+str(val)] if str(col)+"_"+str(val) in dict_feature_value else \
self.__weighted_average(str(col)+"_"+str(val),dict_feature_value) # self.__fixed_value()
preds[i,j] = p1*dict_y[label]/p2
return preds
def __fixed_value(self):
return 1e-3
def __weighted_average(self,key:str,data_dict:dict):
"""插值方式找到离该key对应的最近的data_dict中的key做距离加权平均"""
tmp = key.split("_")
value = float(tmp[-1])
if len(tmp)==3:
tmp_key = tmp[0]+"_"+tmp[1]+"_"
else:
tmp_key = tmp[0] + "_"
# 找到相关的key
# related_keys = []
values = [value]
for k in list(data_dict.keys()):
if tmp_key in k:
# related_keys.append(k)
values.append(float(k.split("_")[-1]))
# 做距离加权
values = sorted(values)
index = values.index(value)
# 取其前一个和后一个做插值
last = max(0,index-1)
next = min(index+1,len(values)-1)
if index==last or index==next:
return self.__fixed_value()
else:
d1=abs(values[last] - value)
d2=abs(values[next] - value)
v1 = data_dict[tmp_key+str(values[last])]
v2 = data_dict[tmp_key+str(values[next])]
# 距离加权 y=e^(-x)
return (np.log(d1)*v1+np.log(d2)*v2)/(np.log(d1)+np.log(d2))
def predict_proba(self,X:np.array):
return self.__predict(X)
def predict(self,X:np.array):
return np.argmax(self.__predict(X),-1)
def accuracy(self,y_true:np.array,y_pred:np.array)->float:
return round(np.sum(y_pred==y_true)/len(y_pred),5)
if __name__=="__main__":
dataPath = "../../dataset/titannic/train.csv"
X, y = loadData(dataPath)
# 划分训练集与测试集
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=40)
start = time.time()
clf = NaiveBayesClassifier()
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print("cost time:%.6f(s) acc:%.3f"%(time.time()-start,clf.accuracy(y_test,y_pred)))
# cost time:0.089734(s) acc:0.771
# 使用sklearn 的GaussianNB
start = time.time()
clf = GaussianNB()
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print("cost time:%.6f(s) acc:%.3f" % (time.time() - start, accuracy_score(y_test, y_pred)))
# cost time:0.001023(s) acc:0.810
# 使用sklearn 的DecisionTreeClassifier
start = time.time()
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print("cost time:%.6f(s) acc:%.3f" % (time.time() - start, accuracy_score(y_test, y_pred)))
# cost time:0.008215(s) acc:0.816
# 使用sklearn 的RandomForestClassifier
start = time.time()
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print("cost time:%.6f(s) acc:%.3f" % (time.time() - start, accuracy_score(y_test, y_pred)))
# cost time:0.018951(s) acc:0.782