##基础公式
贝叶斯定理:P(A|B) = P(B|A)*P(A)/P(B)
假设B1,B2…Bn彼此独立,则有:P(B1xB2x…xBn|A) = P(B1|A)xP(B2|A)x…xP(Bn|A)
##数据(虚构)
A1 A2 A3 A4 A5 B
1 1 1 1 3 no
1 1 1 2 2 soft
1 1 2 1 3 no
1 1 2 2 1 hard
1 2 1 1 2 no
1 2 1 2 3 soft
1 2 2 1 1 no
1 2 2 2 2 hard
2 1 1 1 3 no
2 1 1 2 3 soft
2 1 2 1 1 no
2 1 2 2 1 hard
2 2 1 1 2 no
2 2 1 2 3 soft
2 2 2 1 2 soft
2 2 2 2 2 hard
3 1 1 1 1 no
3 1 1 2 2 soft
3 1 2 1 1 no
3 1 2 2 1 hard
3 2 1 1 3 soft
3 2 1 2 1 soft
3 2 2 1 2 no
3 2 2 2 3 no
五个features,一个label
##算法步骤
1.根据训练集计算概率:
(1)计算:
P(B="hard"),P(B="soft"),P(B="no")
(2)计算:
P(A1="1"|B="hard"),P(A1="2"|B="hard"),P(A1="3"|B="hard");
P(A2="1"|B="hard"),P(A2="2"|B="hard"),...
P(A1="1"|B="soft"),P(A1="2"|B="soft"),P(A1="3"|B="soft");
P(A2="1"|B="soft"),P(A2="2"|B="soft"),...
P(A1="1"|B="no"),P(A1="2"|B="no"),P(A1="3"|B="no");
P(A2="1"|B="no"),P(A2="2"|B="no"),...
2.按照贝叶斯定理计算测试数据分类的概率:
计算:P(B="hard"|test_A) , P(B="soft"|test_A) , P(B="no"|test_A)
概率最大的类别,就是朴素贝叶斯分类器得到的分类结果。
##代码实现
def train(dataSet,labels):
uniqueLabels = set(labels)
res = {}
for label in uniqueLabels:
res[label] = []
res[label].append(labels.count(label)/float(len(labels)))
for i in range(len(dataSet[0])-1):
tempCols = [l[i] for l in dataSet if l[-1]==label]#获取Ai的值
uniqueValues = set(tempCols)
dict = {}
for value in uniqueValues:
count = tempCols.count(value)
prob = count/float(labels.count(label))#计算P(A|B)
dict[value] = prob
res[label].append(dict)
return res
def test(testVect,probMat):
hard = probMat['hard']
soft = probMat['soft']
no = probMat['no']
phard = hard[0]
psoft = soft[0]
pno = no[0]
res = {}
for i in range(len(testVect)):
if testVect[i] in hard[i+1]:
phard *= hard[i+1][testVect[i]]
else:
phard = 0
if testVect[i] in soft[i + 1]:
psoft *= soft[i + 1][testVect[i]]
else:
psoft = 0
if testVect[i] in no[i + 1]:
pno *= no[i + 1][testVect[i]]
else:
pno = 0
res['hard'] = phard
res['soft'] = psoft
res['no'] = pno
print phard, psoft, pno
return max(res, key=res.get)
###获取数据
def loadDataSet(filename):
fr = open(filename)
arrayOLines = fr.readlines()
returnMat = []
labels = []
for line in arrayOLines:
line = line.strip()
listFromLine = line.split(' ')
labels.append(listFromLine[-1])
returnMat.append(listFromLine)
return returnMat,labels
###根据训练集计算概率
这里的res返回的是存储上述算法步骤1中描述的所有概率值的字典。字典结构如下:
{'hard': [P(B="hard"), {'1': P(A1="1"|B="hard"), '2': P(A1="2"|B="hard"), '3': P(A1="3"|B="hard")}, {'1': P(A2="1"|B="hard"), '2': P(A2="2"|B="hard")}, {'1': P(A3="1"|B="hard"),'2':P(A3="2"|B="hard")}, {'1': P(A4="1"|B="hard"),'2':P(A4="2"|B="hard")}, {'1': P(A5="1"|B="hard"), '2': P(A5="2"|B="hard"), '3': P(A5="3"|B="hard")}],
'soft': [P(B="soft"), {'1': P(A1="1"|B="soft"), '2': P(A1="2"|B="soft"), '3': P(A1="3"|B="soft")}, {'1': P(A2="1"|B="soft"), '2': P(A2="2"|B="soft")}, {'1': P(A3="1"|B="soft"),'2':P(A3="2"|B="soft")}, {'1': P(A4="1"|B="soft"),'2':P(A4="2"|B="soft")}, {'1': P(A5="1"|B="soft"), '2': P(A5="2"|B="soft"), '3': P(A5="3"|B="soft")}],
'no': [P(B="no"), {'1': P(A1="1"|B="no"), '2': P(A1="2"|B="no"), '3': P(A1="3"|B="no")}, {'1': P(A2="1"|B="no"), '2': P(A2="2"|B="no")}, {'1': P(A3="1"|B="no"),'2':P(A3="2"|B="no")}, {'1': P(A4="1"|B="no"),'2':P(A4="2"|B="no")}, {'1': P(A5="1"|B="no"), '2': P(A5="2"|B="no"), '3': P(A5="3"|B="no")}]}
其中,若概率为0,则字典里不包含该键值对。
###计算测试样本的分类概率
##测试结果
dataSet , labels = loadDataSet("dataset.txt")
probMat = train(dataSet,labels)
res = test(['3','1','2','2','1'],probMat)
print res