1.数据集的收集清洗
找一个入门级的垃圾邮件分类训练集,如SpamBase(下载传送门:http://archive.ics.uci.edu/ml/machine-learning-databases/spambase/),提取58个属性,最后一位是垃圾邮件的标志位,其余用空格隔开。
def load_SpamBase(filename):
x=[]
y=[]
with open(filename) as f:
for line in f:
line=line.strip('\n')
v=line.split(',')
y.append(int(v[-1]))
t=[]
for i in range(57):
t.append(float(v[i]))
t=np.array(t)
x.append(t)
x=np.array(x)
y=np.array(y)
print x.shape
print y.shape
x_train, x_test, y_train, y_test=train_test_split( x,y, test_size=0.4, random_state=0)
print x_train.shape
print x_test.shape
return x_train, x_test, y_train, y_test
2.分别使用朴素贝叶斯