继上一篇博客留下的坑-CSDN博客

本文链接：https://blog.csdn.net/Qwertyuiop2016/article/details/107164316

在上一篇博客中有一个未知的模型contour-classifier：https://blog.csdn.net/Qwertyuiop2016/article/details/107120290，这个模型是github的大佬直接给出的，并不知道它是如何创建的，所以这篇博客来填一下坑。

模型特征有五个：框宽度、框高度、框面积、框面积/(框高度*框宽度)、框周长
需要得到是结果是这个框中所包含的字符数，既然已经有了模型，我们可以用它产生结果就可以得到想要的数据集。

已经提取的数据集：https://wwa.lanzous.com/i2IsUecwhza
在这里插入图片描述
首先我们先读取数据：

X = []
y = []
with open('F:/result.csv', encoding='utf-8') as f:
    for i, line in enumerate(f):
        if not i:
            continue
        X.append(line.split(',')[:-1])
        y.append(line.split(',')[-1].strip())

从上面的图片中可以看出每个特征的值范围差距很大，extent都小于1，而area又非常大，所以我们需要统一一下数据。contour-classifier-preprocessor这个文件就是干这个活的，从文件名就可以看出来。其实这个文件只是使用了sklearn.preprocessing.StandardScaler这个类的transform方法，这是将数据进行标准化，即将样本矩阵中各列的平均值和标准差分别统一为0和1。另外，sklearn.preprocessing.scale也是做标准化用的，区别请看：https://www.cnblogs.com/weiyunpeng/p/12249308.html。

不过在那篇notebook中并没有使用到StandardScaler的特性，所以和scale区别并不大：

import sklearn.preprocessing as sp
X = sp.scale(X)

训练测试数据分离，比例为0.8:0.2,random_state这个参数我看很多人设定的都是42，不清楚具体意义。

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42)

不知道模型的超参数如果设置能得到更好的分类效果，所以我们使用GridSearchCV来验证不同的超参数的效果，模型的选择和notebook上的选择一样都选SVC分类器：

from sklearn import svm
from sklearn.model_selection import GridSearchCV

clf = svm.SVC()
parameters = {'C':range(1, 11), 'kernel':['linear', 'poly', 'rbf', 'sigmoid'],'gamma': ['scale', 'auto']}
gs = GridSearchCV(clf, parameters, cv=5, n_jobs=-1, scoring='accuracy')
gs.fit(X_train, y_train)

print('最好的参数；', gs.best_params_)
print('训练集准确率：', gs.best_estimator_.score(X_train, y_train)) 
print('测试集准确率：', gs.best_estimator_.score(X_test, y_test))
print('最高的准确率：', gs.best_score_ )

输出如下：

最好的参数；{'C': 5, 'gamma': 'scale', 'kernel': 'rbf'}
训练集准确率：0.9944196428571429
测试集准确率：0.9881129271916791
最高的准确率：0.9914434523809523

上面主要为了解释代码段的功能，这里直接给个合并的代码：

from sklearn import svm
import sklearn.preprocessing as sp
from sklearn.model_selection import train_test_split, GridSearchCV
import numpy as np

X = []
y = []
with open('F:/result.csv', encoding='utf-8') as f:
    for i, line in enumerate(f):
        if not i:
            continue
        X.append(line.split(',')[:-1])
        y.append(line.split(',')[-1].strip())
X = sp.scale(X)

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42)
clf = svm.SVC()
parameters = {'C':range(1, 11), 'kernel':['linear', 'poly', 'rbf', 'sigmoid'],'gamma': ['scale', 'auto']}
gs = GridSearchCV(clf, parameters, cv=5, n_jobs=-1, scoring='accuracy')

gs.fit(X_train, y_train)
print(gs.best_params_)
print(gs.best_estimator_.score(X_train, y_train))
print(gs.best_estimator_.score(X_test, y_test))
print(gs.best_score_ )