参考链接
感知机理论推导:https://blog.csdn.net/ACM_hades/article/details/89496175
数据链接:https://github.com/WenDesi/lihang_book_algorithm/blob/master/data
代码
- 数据集:我们选择MNIST数据集进行实验,它包含各种手写数字(0-9)图片,图片大小28*28。MNIST数据集本身有10个类别,为了将其变成二分类问题我们进行如下处理:label等于0的继续等于0,label大于0改为1。这样就将十分类的数据改为二分类的数据。
- 特征选择:可选择的特征有很多,包括:
- 自己提取特征
- 将整个图片作为特征向量
- HOG特征
- 我们选择HOG特征(324)和将整个图片作为特征(784=28×28)。
代码
import pandas as pd
import numpy as np
import random
import time
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# 利用opencv获取图像hog特征
def get_hog_features(trainset):
features = []
hog = cv2.HOGDescriptor('../hog.xml')
for img in trainset:
img = np.reshape(img,(28,28))
cv_img = img.astype(np.uint8)
hog_feature = hog.compute(cv_img)
features.append(hog_feature)
features = np.array(features)
features = np.reshape(features,(-1,324))
return features
#感知机模型
class Perceptron(object):
def __init__(self):
self.learning_step = 0.00001
self.max_iteration = 5000
def model_function(self, x):
wx = x.dot(self.w)
return np.sign(wx)
def train(self, features, labels):
self.w = np.zeros(len(features[0]) + 1,dtype=np.float32)#将b并入到w中
correct_count = 0
time = 0
while time < self.max_iteration:
index = random.randint(0, len(labels) - 1)#随机选择一个样本进行梯度下降
x = features[index]
x=np.append(x,1.0)#参数b的系数
y=labels[index]
pred=self.model_function(x)
if y * pred > 0:#样本分类正确
correct_count += 1
if correct_count > self.max_iteration:
break
continue
#更新
self.w+=self.learning_step * y * x
def predict(self,features):
labels = []
for feature in features:
x = np.append(feature, 1.0)
labels.append(self.model_function(x))
return labels
if __name__ == '__main__':
print ('Start read data')
S = time.time()
raw_data = pd.read_csv('../data/train_binary.csv')#读取数据
data = raw_data.values#获取数据
print("data shape:",data.shape)
imgs = data[0:, 1:]
labels = data[:, 0]
#imgs = get_hog_features(imgs) # 图片HOG特征(使用HOG特征就打开它)
print("imgs shape:", imgs.shape)
print("labels shape:", labels.shape)
# 选取 2/3 数据作为训练集, 1/3 数据作为测试集
train_features, test_features, train_labels, test_labels = train_test_split(
imgs, labels, test_size=0.33, random_state=23323)
train_labels=2*train_labels-1#将0/1转变为-1/+1
test_labels=2*test_labels-1
print("train data count :%d"%len(train_labels))
print("test data count :%d"%len(test_labels))
print ('read data cost ', time.time() - S, ' second')
print ('Start training')
S = time.time()
p = Perceptron()
p.train(train_features, train_labels)
print( 'training cost ', time.time() - S, ' second')
print('Start predicting')
S = time.time()
test_predict = p.predict(test_features)
print('predicting cost ', time.time() - S, ' second')
score = accuracy_score(test_labels, test_predict)
print( "The accruacy socre is ", score)
输出:
图片HOG特征:
Start read data
data shape: (42000, 785)
imgs shape: (42000, 324)
labels shape: (42000,)
train data count :28140
test data count :13860
read data cost 5.35866117477417 second
Start training
training cost 0.07878541946411133 second
Start predicting
predicting cost 0.12164664268493652 second
The accruacy socre is 0.9935786435786436
源图片特征:
Start read data
data shape: (42000, 785)
imgs shape: (42000, 784)
labels shape: (42000,)
train data count :28140
test data count :13860
read data cost 3.7569241523742676 second
Start training
training cost 0.08876228332519531 second
Start predicting
predicting cost 0.12666058540344238 second
The accruacy socre is 0.9242424242424242
API 说明
- accuracy_score:https://blog.csdn.net/u011630575/article/details/79645814
- train_test_split:https://blog.csdn.net/u011089523/article/details/72810720