0.前言
-
比赛说明
MNIST(“修改后的国家标准与技术研究所”)是计算机视觉事实上的“hello world”数据集。自1999年发布以来,这一经典的手写图像数据集已成为分类算法基准测试的基础。随着新的机器学习技术的出现,MNIST仍然是研究人员和学习者的可靠资源。
在本次比赛中,您的目标是从数万个手写图像的数据集中正确识别数字。我们策划了一套教程式内核,涵盖从回归到神经网络的所有内容。我们鼓励您尝试不同的算法,以便第一手了解哪些方法有效以及技术如何比较。 -
练习技巧
计算机视觉基础包括简单的神经网络 -
分类方法,如SVM和K-最近邻
1.用到的软件包
import numpy as np
import pandas as pd
from time import time
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler
import warnings
warnings.filterwarnings("ignore", category=FutureWarning, module="sklearn")
2.导入数据
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
- 查看信息
train.info()
test.info()
- 查看有无缺失值
print(train.isnull().any().describe())
print()
print(test.isnull().any().describe())
无缺失值,有唯一值
- 查看数据行数
print(train.shape)
print(test.shape)
print(train.head())
训练集的标签在第一行
3.特征预处理
- 将训练集中的特征和标签列分开
X = train.iloc[:,1:]
y = train.iloc[:,0]
- 查看训练集的前10的数字
plt.figure(figsize = (10,5))
for num in range(0,10):
plt.subplot(2,5,num+1)
#将长度为784的向量数据转化为28*28的矩阵
grid_data = X.iloc[num].as_matrix().reshape(28,28)
#显示图片,颜色为黑白
plt.imshow(grid_data, interpolation = "none", cmap = "Greys")
- 特征预处理,将特征的值域规范化
X = MinMaxScaler().fit_transform(X)
print(X)
test = MinMaxScaler().fit_transform(test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state = 14)
[[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
...
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]]
4.主成分分析
训练集维度784列,需要降低纬度
- 使用主成分分析,降低维度
all_scores = [] #画图使用
# 生成n_components的取值列表
n_components = np.linspace(0.7,0.9,num=20, endpoint=False)
print(n_components)
def get_accuracy_score(n, X_train, X_test, y_train, y_test):
'''当主成分为n时,计算模型预测的准确率'''
t0 = time()
pca = PCA(n_components = n)
pca.fit(X_train)
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)
# 使用支持向量机分类器
clf = svm.SVC()
clf.fit(X_train_pca, y_train)
# 计算准确度
accuracy = clf.score(X_test_pca, y_test)
t1 = time()
print('n_components:{:.2f} , accuracy:{:.4f} , time elaps:{:.2f}s'.format(n, accuracy, t1-t0))
return accuracy
for n in n_components:
score = get_accuracy_score(n,X_train, X_test, y_train, y_test)
all_scores.append(score)
[0.7 0.71 0.72 0.73 0.74 0.75 0.76 0.77 0.78 0.79 0.8 0.81 0.82 0.83
0.84 0.85 0.86 0.87 0.88 0.89]
n_components:0.70 , accuracy:0.9750 , time elaps:26.44s
n_components:0.71 , accuracy:0.9757 , time elaps:26.35s
n_components:0.72 , accuracy:0.9769 , time elaps:27.06s
n_components:0.73 , accuracy:0.9760 , time elaps:27.49s
n_components:0.74 , accuracy:0.9776 , time elaps:27.72s
n_components:0.75 , accuracy:0.9781 , time elaps:28.63s
n_components:0.76 , accuracy:0.9781 , time elaps:29.46s
n_components:0.77 , accuracy:0.9781 , time elaps:30.22s
n_components:0.78 , accuracy:0.9783 , time elaps:31.08s
n_components:0.79 , accuracy:0.9776 , time elaps:33.13s
n_components:0.80 , accuracy:0.9779 , time elaps:35.57s
n_components:0.81 , accuracy:0.9771 , time elaps:36.11s
n_components:0.82 , accuracy:0.9774 , time elaps:36.17s
n_components:0.83 , accuracy:0.9769 , time elaps:36.98s
n_components:0.84 , accuracy:0.9755 , time elaps:38.15s
n_components:0.85 , accuracy:0.9748 , time elaps:39.21s
n_components:0.86 , accuracy:0.9748 , time elaps:40.42s
n_components:0.87 , accuracy:0.9729 , time elaps:42.43s
n_components:0.88 , accuracy:0.9721 , time elaps:44.55s
n_components:0.89 , accuracy:0.9717 , time elaps:47.55s
- 画出主成分和准确度的关系图
plt.plot(n_components, all_scores, '-o')
plt.xlabel('n_components')
plt.ylabel('accuracy')
plt.show()
主成分n_components的临界值为0.78时,精确度最高
- 使用SVM基础模型
# 找出识别有误的数据
pca = PCA(n_components = 0.78)
pca.fit(X_train)
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)
clf = svm.SVC()
clf.fit(X_train_pca, y_train)
y_pred = clf.predict(X_test_pca)
errors = (y_pred != y_test)
y_pred_errors = y_pred[errors] #预测
y_test_errors = y_test[errors].values #测试Y
X_test_errors = X_test[errors] #测试X
# 查看预测错误的数据
print(y_pred_errors[:5]) #预测标签结果
print(y_test_errors[:5]) #测试集标签
print(X_test_errors[:5]) #测试集X值
[5 0 8 6 9]
[8 9 6 8 7]
[[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]]
- 数据可视化,查看预测有误的数字
n = 0
nrows = 2
ncols = 5
fig, ax = plt.subplots(nrows,ncols,figsize=(10,6))
for row in range(nrows):
for col in range(ncols):
ax[row,col].imshow((X_test_errors[n]).reshape((28,28)), cmap = "Greys")
ax[row,col].set_title("Predict:{}\nTrue: {}".format(y_pred_errors[n],y_test_errors[n]))
n += 1
5.建模
pca = PCA(n_components=0.78) # n_components为0.78时, 模型的准确率最高
pca.fit(X)
print(pca.n_components_)
# 对训练集和测试集进行主成分转换
X = pca.transform(X)
test = pca.transform(test)
39
对训练集和测试集进行PCA降低维度处理, 主成分个数为39
- 模型调参
# 使用支持向量机预测,使用网格搜索进行调参
clf_svc = GridSearchCV(estimator=svm.SVC(), param_grid={ 'C': [1, 2, 4, 5], 'kernel': [ 'linear', 'rbf', 'sigmoid' ] }, cv=5, verbose=2 )
clf_svc.fit(X, y) # 训练算法
Fitting 5 folds for each of 12 candidates, totalling 60 fits
[CV] C=1, kernel=linear ..............................................
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[CV] ............................... C=1, kernel=linear, total= 27.5s
[CV] C=1, kernel=linear ..............................................
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 27.4s remaining: 0.0s
[CV] ............................... C=1, kernel=linear, total= 27.7s
[CV] C=1, kernel=linear ..............................................
[CV] ............................... C=1, kernel=linear, total= 25.9s
[CV] C=1, kernel=linear ..............................................
[CV] ............................... C=1, kernel=linear, total= 26.8s
[CV] C=1, kernel=linear ..............................................
[CV] ............................... C=1, kernel=linear, total= 27.5s
[CV] C=1, kernel=rbf .................................................
[CV] .................................. C=1, kernel=rbf, total= 24.5s
[CV] C=1, kernel=rbf .................................................
[CV] .................................. C=1, kernel=rbf, total= 23.8s
[CV] C=1, kernel=rbf .................................................
[CV] .................................. C=1, kernel=rbf, total= 23.7s
[CV] C=1, kernel=rbf .................................................
[CV] .................................. C=1, kernel=rbf, total= 23.6s
[CV] C=1, kernel=rbf .................................................
[CV] .................................. C=1, kernel=rbf, total= 23.5s
[CV] C=1, kernel=sigmoid .............................................
[CV] .............................. C=1, kernel=sigmoid, total= 29.4s
[CV] C=1, kernel=sigmoid .............................................
[CV] .............................. C=1, kernel=sigmoid, total= 29.6s
[CV] C=1, kernel=sigmoid .............................................
[CV] .............................. C=1, kernel=sigmoid, total= 30.4s
[CV] C=1, kernel=sigmoid .............................................
[CV] .............................. C=1, kernel=sigmoid, total= 29.5s
[CV] C=1, kernel=sigmoid .............................................
[CV] .............................. C=1, kernel=sigmoid, total= 29.4s
[CV] C=2, kernel=linear ..............................................
[CV] ............................... C=2, kernel=linear, total= 36.7s
[CV] C=2, kernel=linear ..............................................
[CV] ............................... C=2, kernel=linear, total= 36.6s
[CV] C=2, kernel=linear ..............................................
[CV] ............................... C=2, kernel=linear, total= 35.7s
[CV] C=2, kernel=linear ..............................................
[CV] ............................... C=2, kernel=linear, total= 37.1s
[CV] C=2, kernel=linear ..............................................
[CV] ............................... C=2, kernel=linear, total= 37.6s
[CV] C=2, kernel=rbf .................................................
[CV] .................................. C=2, kernel=rbf, total= 23.1s
[CV] C=2, kernel=rbf .................................................
[CV] .................................. C=2, kernel=rbf, total= 23.5s
[CV] C=2, kernel=rbf .................................................
[CV] .................................. C=2, kernel=rbf, total= 23.1s
[CV] C=2, kernel=rbf .................................................
[CV] .................................. C=2, kernel=rbf, total= 22.5s
[CV] C=2, kernel=rbf .................................................
[CV] .................................. C=2, kernel=rbf, total= 22.7s
[CV] C=2, kernel=sigmoid .............................................
[CV] .............................. C=2, kernel=sigmoid, total= 27.3s
[CV] C=2, kernel=sigmoid .............................................
[CV] .............................. C=2, kernel=sigmoid, total= 27.0s
[CV] C=2, kernel=sigmoid .............................................
[CV] .............................. C=2, kernel=sigmoid, total= 27.3s
[CV] C=2, kernel=sigmoid .............................................
[CV] .............................. C=2, kernel=sigmoid, total= 27.3s
[CV] C=2, kernel=sigmoid .............................................
[CV] .............................. C=2, kernel=sigmoid, total= 27.2s
[CV] C=4, kernel=linear ..............................................
[CV] ............................... C=4, kernel=linear, total= 53.3s
[CV] C=4, kernel=linear ..............................................
[CV] ............................... C=4, kernel=linear, total= 52.7s
[CV] C=4, kernel=linear ..............................................
[CV] ............................... C=4, kernel=linear, total= 52.0s
[CV] C=4, kernel=linear ..............................................
[CV] ............................... C=4, kernel=linear, total= 54.8s
[CV] C=4, kernel=linear ..............................................
[CV] ............................... C=4, kernel=linear, total= 53.2s
[CV] C=4, kernel=rbf .................................................
[CV] .................................. C=4, kernel=rbf, total= 20.9s
[CV] C=4, kernel=rbf .................................................
[CV] .................................. C=4, kernel=rbf, total= 21.0s
[CV] C=4, kernel=rbf .................................................
[CV] .................................. C=4, kernel=rbf, total= 20.8s
[CV] C=4, kernel=rbf .................................................
[CV] .................................. C=4, kernel=rbf, total= 22.2s
[CV] C=4, kernel=rbf .................................................
[CV] .................................. C=4, kernel=rbf, total= 22.2s
[CV] C=4, kernel=sigmoid .............................................
[CV] .............................. C=4, kernel=sigmoid, total= 26.3s
[CV] C=4, kernel=sigmoid .............................................
[CV] .............................. C=4, kernel=sigmoid, total= 26.4s
[CV] C=4, kernel=sigmoid .............................................
[CV] .............................. C=4, kernel=sigmoid, total= 25.9s
[CV] C=4, kernel=sigmoid .............................................
[CV] .............................. C=4, kernel=sigmoid, total= 26.6s
[CV] C=4, kernel=sigmoid .............................................
[CV] .............................. C=4, kernel=sigmoid, total= 27.3s
[CV] C=5, kernel=linear ..............................................
[CV] ............................... C=5, kernel=linear, total= 1.1min
[CV] C=5, kernel=linear ..............................................
[CV] ............................... C=5, kernel=linear, total= 1.3min
[CV] C=5, kernel=linear ..............................................
[CV] ............................... C=5, kernel=linear, total= 1.1min
[CV] C=5, kernel=linear ..............................................
[CV] ............................... C=5, kernel=linear, total= 1.1min
[CV] C=5, kernel=linear ..............................................
[CV] ............................... C=5, kernel=linear, total= 1.1min
[CV] C=5, kernel=rbf .................................................
[CV] .................................. C=5, kernel=rbf, total= 23.5s
[CV] C=5, kernel=rbf .................................................
[CV] .................................. C=5, kernel=rbf, total= 23.4s
[CV] C=5, kernel=rbf .................................................
[CV] .................................. C=5, kernel=rbf, total= 23.4s
[CV] C=5, kernel=rbf .................................................
[CV] .................................. C=5, kernel=rbf, total= 23.9s
[CV] C=5, kernel=rbf .................................................
[CV] .................................. C=5, kernel=rbf, total= 23.8s
[CV] C=5, kernel=sigmoid .............................................
[CV] .............................. C=5, kernel=sigmoid, total= 29.1s
[CV] C=5, kernel=sigmoid .............................................
[CV] .............................. C=5, kernel=sigmoid, total= 28.3s
[CV] C=5, kernel=sigmoid .............................................
[CV] .............................. C=5, kernel=sigmoid, total= 30.1s
[CV] C=5, kernel=sigmoid .............................................
[CV] .............................. C=5, kernel=sigmoid, total= 33.0s
[CV] C=5, kernel=sigmoid .............................................
[CV] .............................. C=5, kernel=sigmoid, total= 32.3s
[Parallel(n_jobs=1)]: Done 60 out of 60 | elapsed: 32.6min finished
# 显示使模型准确率最高的参数
print(clf_svc.best_params_)
{'C': 5, 'kernel': 'rbf'}
#显示最高模型准确率
print(clf_svc.best_score_)
0.9828809523809524
- 预测
preds = clf_svc.predict(test)
image_id = pd.Series(range(1,len(preds)+1))
result_2 = pd.DataFrame({'ImageID': image_id,'Label':preds})
# 保存为CSV文件
result_2.to_csv('result_svc.csv',index = False)
print('Over')