机器学习算法完整版见fenghaootong-github
贝叶斯解决手写体
数据集描述
数据文件train.csv和test.csv包含从零到九的手绘数字的灰度图像。
每个图像是高28个像素,宽28个像素,总共784像素,每个像素都有一个与之相关的像素值,用来表示像素的亮度,数字越高亮度越暗,这个值的范围是0-255
训练集有785列,第一列是标签是用户绘制的真实数字,剩下的列是像素值,每一行是一个数字
训练集中的每个像素列都有一个名称,如pixelx,其中x是0到783之间的整数。为了在图像上定位这个像素,假设我们已经将x分解为x = i * 28 + j,其中i和j是0到27之间的整数。然后,pixelx位于28 x 28矩阵的第i行和第j列(索引为零)。
测试数据集(test.csv)与训练集相同,只是它不包含“标签”列。
000 001 002 003 ... 026 027
028 029 030 031 ... 054 055
056 057 058 059 ... 082 083
| | | | ... | |
728 729 730 731 ... 754 755
756 757 758 759 ... 782 783
代码如下
from sklearn import neighbors
import pandas as pd
加载数据
df = pd.read_csv('../DATA/train.csv')
labels = df.as_matrix(columns=['label'])#find lable to transform to matrix
dataset = df.drop('label', axis=1).as_matrix()#transform dataset to matrxi without drop lable
dataset = dataset / (28.0*28.0)
把数据分为训练和验证集
train_len = int(len(labels.ravel()) * 0.75)
train_dataset = dataset[:train_len]
train_labels = labels[:train_len]
valid_dataset = dataset[train_len:]
valid_labels = labels[train_len:]
len(valid_labels.ravel())
1050
导入bayes
from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB
训练模型
GNB = MultinomialNB(alpha=0.1)
GNB.fit(train_dataset, train_labels.ravel())
MultinomialNB(alpha=0.1, class_prior=None, fit_prior=True)
预测
predictions = [int(a) for a in GNB.predict(valid_dataset)]
sum = 0
for a,y in zip(predictions, valid_labels.ravel()):
if a == y:
sum += 1
print("%s of %s values corrent. \ntest accuracy: %f" % (sum, len(valid_labels.ravel()), sum / len(valid_labels.ravel())))
877 of 1050 values corrent.
test accuracy: 0.835238
定义函数,进行多模型预测
classfiers = ((MultinomialNB, dict(alpha=0.1)),
(GaussianNB,{}),
(BernoulliNB,{}))
def bayes(args):
for classfier, kwargs in args:
print(classfier.__name__)
sum = 0
model = classfier(**kwargs)
model.fit(train_dataset, train_labels.ravel())
predictions = [int(a) for a in model.predict(valid_dataset)]
for a,y in zip(predictions, valid_labels.ravel()):
if a == y:
sum += 1
print("%s \n %s of %s values corrent. \ntest accuracy: %f" % (classfier, sum, len(valid_labels.ravel()), sum / len(valid_labels.ravel())))
bayes(classfiers)
MultinomialNB
<class 'sklearn.naive_bayes.MultinomialNB'>
877 of 1050 values corrent.
test accuracy: 0.835238
GaussianNB
<class 'sklearn.naive_bayes.GaussianNB'>
619 of 1050 values corrent.
test accuracy: 0.589524
BernoulliNB
<class 'sklearn.naive_bayes.BernoulliNB'>
874 of 1050 values corrent.
test accuracy: 0.832381