本文由本人原创,仅作为自己的学习记录
数据:假设下面是课程数据,课程数据分为,价格A,课时B,销量C
价格A | 课时B | 销量C |
低 | 多 | 高 |
高 | 中 | 高 |
低 | 少 | 高 |
低 | 中 | 低 |
中 | 中 | 中 |
高 | 多 | 高 |
低 | 少 | 中 |
现在学校出了一门新的课程,课程价格A=高,课时B=多,需要预测这个课程的销量
这个问题提出了预测之后的结果,而朴素贝叶斯正好可以满足这一点,网上大多是直接调用API进行预测,实际上最好还是自己实现朴素贝叶斯,朴素贝叶斯公式:P(B|A)=P(A|B)P(B)/P(A),而本文中,公式即为,P(C|AB)=P(AB|C)P(C)/P(AB)=P(A|C)P(B|C)P(C)/P(AB),方法就是分别推算出C为低销量,中销量,高销量时候的概率,然后进行比较,反馈出最大的概率为预测的结果
顺便说一句,朴素两个字意思就是说,AB之间相互独立,互不影响,实际上价格和课时是存在一定的关系的,但是朴素贝叶斯把它当做独立来处理,以计算销量的预测的概率。
下面给出我的代码:
#coding=utf-8
from __future__ import division
from numpy import array
def set_data(price,time,sale):
price_number =[]
time_number= []
sale_number =[]
for i in price:
if i=="低":
price_number.append(0)
elif i=="中":
price_number.append(1)
elif i=="高":
price_number.append(2)
for j in time:
if j=="少":
time_number.append(0)
elif j=="中":
time_number.append(1)
elif j=="多":
time_number.append(2)
for k in sale:
if k=="低":
sale_number.append(0)
elif k=="中":
sale_number.append(1)
elif k=="高":
sale_number.append(2)
return price_number,time_number,sale_number
def naive_bs(price_number,time_number,sale_number,expected_price,expected_time):
price_p=[]
time_p=[]
sale_p=[]
m = array(zip(price_number,time_number,sale_number)).T
for i in range(3):
price_p.append(price.count(i)/len(price_number)) #计算各项概率
time_p.append(time.count(i)/len(time_number))
sale_p.append(sale.count(i)/len(sale_number))
advance_sale=[]
p_ex_price = price.count(expected_price)/len(price_number)
p_ex_time = time.count(expected_time)/len(time_number)
low_ex_sale=0
middle_ex_sale=0
high_ex_sale=0
for i in range(0,len(sale_number)):
if sale_number[i]==0:
low_ex_sale=low_ex_sale+1
elif sale_number[i]==1:
middle_ex_sale=middle_ex_sale+1
elif sale_number[i]==2:
high_ex_sale=high_ex_sale+1
#统计p(c)出现的概率
#计算不同情况
aa=0
bb=0
cc=0
for i in range(0,len(price_number)):
if expected_price==price_number[i] and sale_number[i]==0:
aa=aa+1
elif expected_price==price_number[i] and sale_number[i]==1:
bb=bb+1
elif expected_price==price_number[i] and sale_number[i]==2:
cc=cc+1
p_aa = aa/low_ex_sale
p_bb =bb/middle_ex_sale
p_cc = cc/high_ex_sale
print "p(a|c):%s ,%s,%s"%(p_aa,p_bb,p_cc)
aaa=0
bbb=0
ccc=0
for i in range(0,len(time_number)):
if expected_time==time_number[i] and sale_number[i]==0:
aaa=aaa+1
elif expected_time==time_number[i] and sale_number[i]==1:
bbb=bbb+1
elif expected_time==time_number[i] and sale_number[i]==2:
ccc=ccc+1
p_aaa=aaa/low_ex_sale
p_bbb=bbb/middle_ex_sale
p_ccc=ccc/high_ex_sale
print "p(b|c): %s,%s,%s"%(p_aaa,p_bbb,p_ccc)
final_low_p = p_aa*p_aaa*low_ex_sale/len(sale_number)*1000
final_midd_p = p_bb*p_bbb*middle_ex_sale/len(sale_number)*1000
final_high_p = p_cc*p_ccc*high_ex_sale/len(sale_number)*1000
final_list=[final_low_p,final_midd_p,final_high_p]
final_index= final_list.index(max(final_list))
print final_list
if final_index==0:
print "销量预测销量为低"
elif final_index==1:
print "销量预测销量为中"
else:
print "销量预测销量为高"
if __name__=="__main__":
price = ["低","高","低","低","中","高","低"]
time = ["多","中","少","中","中","多","少"]
sale = ["高","高","高","低","中","高","中"]
expected_price="高" #新课程价格高
expected_time="高" #新课程课时多
if expected_price=="低":
expected_price_id=0
elif expected_price=="中":
expected_price_id=1
else:
expected_price_id=2
if expected_time=="少":
expected_time_id=0
elif expected_time=="中":
expected_time_id=1
else:
expected_time_id=2
price_number,time_number,sale_number= set_data(price, time, sale)
print price_number,time_number,sale_number
naive_bs(price_number, time_number, sale_number, expected_price_id, expected_time_id)
代码对三个特征进行处理,让属性分别用0,1,2来进行标识,代码是基于价格,课时,销量三个特征的列表长度相等,实际上我们拿到的数据应该是不相同的,应该先对数据处理,即进行数据预处理(主要是缺失值与异常值处理)。
下面是我在eclipse里的运行结果:
[0, 2, 0, 0, 1, 2, 0] [2, 1, 0, 1, 1, 2, 0] [2, 2, 2, 0, 1, 2, 1]
p(a|c):0.0 ,0.0,0.5
p(b|c): 0.0,0.0,0.5
[0.0, 0.0, 142.85714285714286]
预测销量为高
本文仅作为自己的学习记录,可能存在很多不足之处。