先把来源写上
来源:贪心学院,https://www.zhihu.com/people/tan-xin-xue-yuan/activities
以前文章:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
# use seaborn plotting defaults
import seaborn as sns; sns.set()
模拟数据集
from sklearn.datasets.samples_generator import make_blobs
X, y = make_blobs(n_samples=50, centers=2,
random_state=0, cluster_std=0.60)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn');
xfit = np.linspace(-1, 3.5)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn')
for m, b in [(1, 0.65), (0.5, 1.6), (-0.2, 2.9)]:
plt.plot(xfit, m * xfit + b, '-k')
plt.xlim(-1, 3.5);
假想每一条分割线是有宽度的
xfit = np.linspace(-1, 3.5)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn')
for m, b, d in [(1, 0.65, 0.33), (0.5, 1.6, 0.55), (-0.2, 2.9, 0.2)]:
yfit = m * xfit + b
plt.plot(xfit, yfit, '-k')
plt.fill_between(xfit, yfit - d, yfit + d, edgecolor='none',
color='#AAAAAA', alpha=0.4)
plt.xlim(-1, 3.5);
在SVM的框架下, 认为最宽的线为最优的分割线
人脸识别
所以pass到了
作业 使用SVM检测蘑菇?是否有毒
使用的数据集: https://archive.ics.uci.edu/ml/datasets/mushroom
数据集中每一条数据包含如下特征, 特征包括对蘑菇形状, 质地, 色彩等特征的描述, 我们需要以此判断?是否有毒§或者可以吃(e). 为两个类别的分类问题.
22个
建立寻pipeline
# 将特征和类别标签分布赋值给 X 和 y
X_mush = mush_df_encoded.iloc[:,2:]
y_mush = mush_df_encoded.iloc[:,1]
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.pipeline import make_pipeline
pca = PCA(n_components=14, whiten=True, random_state=42)
svc = SVC(kernel='linear', class_weight='balanced')
model = make_pipeline(pca, svc)
将数据分为训练和测试数据¶
from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X_mush, y_mush,
random_state=41)
调参:通过交叉验证寻找最佳的 C (控制间隔的大小)
from sklearn.model_selection import GridSearchCV
param_grid = {'svc__C': [1, 5, 10, 50]}
grid = GridSearchCV(model, param_grid)
%time grid.fit(Xtrain, ytrain)
print(grid.best_params_)
Wall time: 17.3 s
{'svc__C': 50}
使用训练好的SVM做预测¶
model = grid.best_estimator_
yfit = model.predict(Xtest)
生成性能报告
比较下老师的
竟然又一个数高过它