在本文中,我们将使用Python中最流行的机器学习工具Scikit-learn在Python中实现几种机器学习算法。使用简单的数据集来训练分类器以区分不同类型的水果。
本文的目的是确定最适合手头问题的机器学习算法; 因此,我们想要比较不同的算法,选择效果最好的算法。
数据
水果数据集由爱丁堡大学的Iain Murray博士创建。他买了几十个不同品种的桔子、柠檬和苹果,并记录了他们的尺寸。
让我们看一下数据的前几行。
%matplotlib inlineimport pandas as pdimport matplotlib.pyplot as pltfruits = pd.read_table('fruit_data_with_colors.txt')fruits.head()
![1da126f7a5dd4084463c6379350909a7.png](https://i-blog.csdnimg.cn/blog_migrate/b2b2638ed4a818292be1c504bfc9adaa.jpeg)
![441c47799a1801ab19811102e86f4ebb.png](https://i-blog.csdnimg.cn/blog_migrate/cc6206f8a5683d1de6d8d6b3aad9d356.jpeg)
数据集的每一行表示水果的一个部分,由列表中的几个特征表示。
我们的数据集中有59个水果和7个特征:
print(fruits.shape)
(59,7)
我们的数据集中有四种类型的水果:
print(fruits['fruit_name'].unique())
['苹果''橘子(mandarin)''橙子''柠檬']
除橘子外,数据非常平衡。我们必须坚持下去。
print(fruits.groupby('fruit_name').size())
![bb7691b53a86e1d67ec0a53935471d04.png](https://i-blog.csdnimg.cn/blog_migrate/0e431211dd2375f1f0cc515d19b1bce7.jpeg)
import seaborn as snssns.countplot(fruits['fruit_name'],label="Count")plt.show()
![31c51245194b90047b67bf5be9f9bda5.png](https://i-blog.csdnimg.cn/blog_migrate/6f03930800afb0898182b95b152f5cbe.jpeg)
![a162406dae14c8d00679d2347231d36b.png](https://i-blog.csdnimg.cn/blog_migrate/3ce45d4f6c2fde2a2395ba8f030e80f9.jpeg)
可视化
- 每个数字变量的方形图将使我们更清楚地了解输入变量的分布:
fruits.drop('fruit_label', axis=1).plot(kind='box', subplots=True, layout=(2,2), sharex=False, sharey=False, figsize=(9,9), )plt.savefig('fruits_box')plt.show()
![83a802284486d61eaf9a7fe370eaab2e.png](https://i-blog.csdnimg.cn/blog_migrate/9b21fadbd5a8bb03006698168dad5643.jpeg)
![d5085854a490a3d775dee8148eab3d25.png](https://i-blog.csdnimg.cn/blog_migrate/bf28ddf9933055fce13243c5788af7d7.jpeg)
- 颜色分数近似于高斯分布。
import pylab as plfruits.drop('fruit_label' ,axis=1).hist(bins=30, figsize=(9,9))pl.suptitle("Histogram for each numeric input variable")plt.savefig('fruits_hist')plt.show()
![0b31db59691fa57fca48ce1eef7e775b.png](https://i-blog.csdnimg.cn/blog_migrate/bf864e14c306b3b587689e2923d4210c.jpeg)
![c5b50b0348916e3123012faf42773977.png](https://i-blog.csdnimg.cn/blog_migrate/4d5c18e2dc77dad289f9d0d44c0f36fb.jpeg)
- 一些属性对是相关的(质量和宽度)。这表明高度相关性和可预测的关系。
from pandas.tools.plotting import scatter_matrixfrom matplotlib import cmfeature_names = ['mass', 'width', 'height', 'color_score']X = fruits[feature_names]y = fruits['fruit_label']cmap = cm.get_cmap('gnuplot')scatter = pd.scatter_matrix(X, c = y, marker = 'o', s=40, hist_kwds={'bins':15}, figsize=(9,9), cmap = cmap)plt.suptitle('Scatter-matrix for each input variable')plt.savefig('fruits_scatter_matrix')
![a9ceac828fe7e0073d69d0808f5fa854.png](https://i-blog.csdnimg.cn/blog_migrate/ee9348dfee15fa90a114fae69dc9c93c.jpeg)
![a4183d9815889a6f49ef83bfe8817dde.png](https://i-blog.csdnimg.cn/blog_migrate/8fadbc466b115b69dbedc0ffeb4e4df8.jpeg)
统计摘要
![fe321856247d742061be2f98f0339db3.png](https://i-blog.csdnimg.cn/blog_migrate/251c9dcdeb75c7429f9d3fb093b978fe.jpeg)
我们可以看到数值没有相同的比例。我们需要对我们为训练集计算的测试集扩展应用。
创建训练和测试集扩展到应用。
from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)from sklearn.preprocessing import MinMaxScalerscaler = MinMaxScaler()X_train = scaler.fit_transform(X_train)X_test = scaler.transform(X_test)
![cabb181bf8caa9942dbf1475572e4097.png](https://i-blog.csdnimg.cn/blog_migrate/25fdf5de2570dea53913501823920a93.jpeg)
构建模型
Logistic回归
from sklearn.linear_model import LogisticRegressionlogreg = LogisticRegression()logreg.fit(X_train, y_train)print('Accuracy of Logistic regression classifier on training set: {:.2f}' .format(logreg.score(X_train, y_train)))print('Accuracy of Logistic regression classifier on test set: {:.2f}' .format(logreg.score(X_test, y_test)))
![00c5feeb3581ef3e7fb07bad6da7129b.png](https://i-blog.csdnimg.cn/blog_migrate/3ef527dba2fb715b2a79a4c2e7260705.jpeg)
Logistic回归分类器在训练集上的准确率:0.70
Logistic回归分类器在测试集上的准确率:0.40
决策树
from sklearn.tree import DecisionTreeClassifierclf = DecisionTreeClassifier().fit(X_train, y_train)print('Accuracy of Decision Tree classifier on training set: {:.2f}' .format(clf.score(X_train, y_train)))print('Accuracy of Decision Tree classifier on test set: {:.2f}' .format(clf.score(X_test, y_test)))
![18dd31968f2d73cf279d55771527d802.png](https://i-blog.csdnimg.cn/blog_migrate/ead90fdcf31180e19519b4a8410fa48c.jpeg)
决策树分类器在训练集上的准确率:1.00
决策树分类器在测试集上的准确率:0.73
K-Nearest Neighbors
from sklearn.neighbors import KNeighborsClassifierknn = KNeighborsClassifier()knn.fit(X_train, y_train)print('Accuracy of K-NN classifier on training set: {:.2f}' .format(knn.score(X_train, y_train)))print('Accuracy of K-NN classifier on test set: {:.2f}' .format(knn.score(X_test, y_test)))
![c7e958d2ba774883833b324b85da7442.png](https://i-blog.csdnimg.cn/blog_migrate/06007d33553d437b3275dd967a3dd881.jpeg)
K-NN分类器在训练集上的准确率:0.95
K-NN分类器在测试集上的准确率:1.00
线性判别分析
from sklearn.discriminant_analysis import LinearDiscriminantAnalysislda = LinearDiscriminantAnalysis()lda.fit(X_train, y_train)print('Accuracy of LDA classifier on training set: {:.2f}' .format(lda.score(X_train, y_train)))print('Accuracy of LDA classifier on test set: {:.2f}' .format(lda.score(X_test, y_test)))
![72d6dc0a32a264e7542409d1cac3b38e.png](https://i-blog.csdnimg.cn/blog_migrate/20004895cbe914f3f85ef7f1b39914a4.jpeg)
LDA分类器在训练集上的准确率:0.86
LDA分类器在测试集上的准确率:0.67
高斯朴素贝叶斯
from sklearn.naive_bayes import GaussianNBgnb = GaussianNB()gnb.fit(X_train, y_train)print('Accuracy of GNB classifier on training set: {:.2f}' .format(gnb.score(X_train, y_train)))print('Accuracy of GNB classifier on test set: {:.2f}' .format(gnb.score(X_test, y_test)))
![e3d85ec8389ef35f18406061dfb5b871.png](https://i-blog.csdnimg.cn/blog_migrate/a6868738ee527ed6f5fe24a0486648ed.jpeg)
GNB分类器在训练集上的准确率:0.86
GNB分类器在测试集上的准确率:0.67
支持向量机
from sklearn.svm import SVCsvm = SVC()svm.fit(X_train, y_train)print('Accuracy of SVM classifier on training set: {:.2f}' .format(svm.score(X_train, y_train)))print('Accuracy of SVM classifier on test set: {:.2f}' .format(svm.score(X_test, y_test)))
![2ad3ce58037786a3ec1886f953ddfde9.png](https://i-blog.csdnimg.cn/blog_migrate/78622386af0e4a78e46f3440ef0d6560.jpeg)
SVM分类器在训练集上的准确率:0.61
SVM分类器在测试集上的准确率:0.33
KNN算法是我们尝试过的最准确的模型。混淆矩阵表示测试集没有发生错误。但是,测试集非常小。
from sklearn.metrics import classification_reportfrom sklearn.metrics import confusion_matrixpred = knn.predict(X_test)print(confusion_matrix(y_test, pred))print(classification_report(y_test, pred))
![5e4afa60299464efd4ad79091fc847e6.png](https://i-blog.csdnimg.cn/blog_migrate/e7183699008be01360830413b0e0f06d.jpeg)
![2e5c94ba5ee554a4559837ce5758255e.png](https://i-blog.csdnimg.cn/blog_migrate/8b205f32ae0891b02de4ccbb8974b597.jpeg)
绘制k-NN分类器的决策边界
import matplotlib.cm as cmfrom matplotlib.colors import ListedColormap, BoundaryNormimport matplotlib.patches as mpatchesimport matplotlib.patches as mpatchesX = fruits[['mass', 'width', 'height', 'color_score']]y = fruits['fruit_label']X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)def plot_fruit_knn(X, y, n_neighbors, weights): X_mat = X[['height', 'width']].as_matrix() y_mat = y.as_matrix()# Create color maps cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF','#AFAFAF']) cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF','#AFAFAF'])clf = neighbors.KNeighborsClassifier(n_neighbors, weights=weights) clf.fit(X_mat, y_mat)# Plot the decision boundary by assigning a color in the color map # to each mesh point. mesh_step_size = .01 # step size in the mesh plot_symbol_size = 50 x_min, x_max = X_mat[:, 0].min() - 1, X_mat[:, 0].max() + 1 y_min, y_max = X_mat[:, 1].min() - 1, X_mat[:, 1].max() + 1 xx, yy = np.meshgrid(np.arange(x_min, x_max, mesh_step_size), np.arange(y_min, y_max, mesh_step_size)) Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])# Put the result into a color plot Z = Z.reshape(xx.shape) plt.figure() plt.pcolormesh(xx, yy, Z, cmap=cmap_light)# Plot training points plt.scatter(X_mat[:, 0], X_mat[:, 1], s=plot_symbol_size, c=y, cmap=cmap_bold, edgecolor = 'black') plt.xlim(xx.min(), xx.max()) plt.ylim(yy.min(), yy.max())patch0 = mpatches.Patch(color='#FF0000', label='apple') patch1 = mpatches.Patch(color='#00FF00', label='mandarin') patch2 = mpatches.Patch(color='#0000FF', label='orange') patch3 = mpatches.Patch(color='#AFAFAF', label='lemon') plt.legend(handles=[patch0, patch1, patch2, patch3])plt.xlabel('height (cm)')plt.ylabel('width (cm)')plt.title("4-Class classification (k = %i, weights = '%s')" % (n_neighbors, weights)) plt.show()plot_fruit_knn(X_train, y_train, 5, 'uniform')
![36546fc79475a17df6b576595edf1c92.png](https://i-blog.csdnimg.cn/blog_migrate/f93930ee6508fb42e4209bdb10dd9d7f.jpeg)
![68a09dd8f112506aa41c01bf70067d7f.png](https://i-blog.csdnimg.cn/blog_migrate/2eba887069fc2e73dfd9d89cc8e19eb3.jpeg)
k_range = range(1, 20)scores = []for k in k_range: knn = KNeighborsClassifier(n_neighbors = k) knn.fit(X_train, y_train) scores.append(knn.score(X_test, y_test))plt.figure()plt.xlabel('k')plt.ylabel('accuracy')plt.scatter(k_range, scores)plt.xticks([0,5,10,15,20])
![8f990bc122b71da68529e48480d00a35.png](https://i-blog.csdnimg.cn/blog_migrate/952b639cd7fde2302769e64cc7db944a.jpeg)
![8b81c77194c2fdd7f340d1023e3e4991.png](https://i-blog.csdnimg.cn/blog_migrate/7b555a38de81ec3de8a472a191b76236.jpeg)
对于这个特定的数据集,当k = 5时,我们获得最高的准确度
总结
在本文中,我们关注预测的准确性。我们的目标是学习具有良好泛化性能的模型。这种模型使预测精度最大化。我们确定了最适合手头问题的机器学习算法(即水果类型分类); 因此,我们比较了不同的算法并选择了性能最佳的算法。