Python与机器学习1——scikit-learn使用的简易框架

最新推荐文章于 2024-07-04 06:47:57 发布

I_am_Damon

最新推荐文章于 2024-07-04 06:47:57 发布

阅读量919

点赞数

分类专栏： python 机器学习文章标签：机器学习 python

本文链接：https://blog.csdn.net/u012824853/article/details/60966980

版权

python 同时被 2 个专栏收录

2 篇文章 0 订阅

订阅专栏

机器学习

2 篇文章 0 订阅

订阅专栏

机器学习大火，忽然间很多人都朝这里使劲，我是个研一的学生，这并不是我的专业方向，出于种种原因，我也来了。
小白一枚，从零自学，经两个月跌跌撞撞，这里一锤子那里一棒子的学习，确定了前期的学习路线并开博客，与君共勉。
本系列博客主要参考《利用Python进行数据分析》、《Python数据挖掘入门与实践》、《机器学习》（周志华）。以后两本为主线学习。
第一本书作为工具书，用于补充Python、Pandas等背景知识；
第二本书作为实践书，主要利用scikit-learn练习算法的使用和调参等等；
第三本书作为理论书，结合第二本加强对算法的理解。
当然内功还需线性代数、概率论等。本人尝试过先过一遍数学，可没有不经过实践的理论转身就忘，所以会在读这三本书的同时穿插数学基础。

先依托简单的K近邻算法，熟悉最简单的scikit-learn使用框架，讲解都在代码的注释中。
为便于理解，我把“导入库”的语句都写在了距离“调用库”语句最近的上方。
使用scikit-learn最简框架

import numpy as np
import os

data_filename = os.path.join("C:\Users\Han Chunhui", "Ionosphere","ionosphere.data")#import dataset
x = np.zeros((351, 34), dtype='float')#create space for data ,351 rows and 34 columns
y = np.zeros((351,), dtype='bool')#create space for labels ,351 rows and 1 column

import csv
with open(data_filename, 'r') as input_file:
    reader = csv.reader(input_file)
    for i, row in enumerate(reader):
        data = [float(datum) for datum in row[:-1]]
        x[i] = data
        y[i] = row[-1] == 'g'#x is the set of data , y is the set of labels

from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=14)#split the set of train and test
from sklearn.neighbors import KNeighborsClassifier
estimator = KNeighborsClassifier()
estimator.fit(x_train, y_train)#"KNeighborsClassifier" is a object ,fit is a method of this object to train the set of train
y_predicted = estimator.predict(x_test)#predict is a method of this object to predict the result of the set of test
accuracy = np.mean(y_test == y_predicted) * 100#compare the result with the fact ,and we get the accuracy.
print("The accuracy is {0:.1f}%".format(accuracy))#print is "The accuracy is 86.4%"

接下来使用“交叉验证”方法测试算法性能，简单来说“交叉验证”就是在同一数据集中多次切分出不同的训练集和测试集。

from sklearn.cross_validation import cross_val_score
scores = cross_val_score(estimator, x, y, scoring='accuracy') #cross validation
average_accuracy = np.mean(scores) * 100
print("The average accuracy is {0:.1f}%".format(average_accuracy))#print is "The average accuracy is 82.3%"

以上使用的是默认参数（即K近邻算法中的近邻个数为默认），接下来人为调整参数看看不同参数的效果。

avg_scores = []#save the results produced by all parameters.
all_scores = []
parameter_values = list(range(1, 21)) #change parameter from 1 to 20
for n_neighbors in parameter_values:
    estimator = KNeighborsClassifier(n_neighbors=n_neighbors)#change parameter
    scores = cross_val_score(estimator, x, y, scoring='accuracy')#cross validation
    avg_scores.append(np.mean(scores))#add result to avg_scores
    all_scores.append(scores)
from matplotlib import pyplot as plt
plt.plot(parameter_values,avg_scores, '-o')#draw the results
plt.show()

参数从0到20，正确率的变化如下图所示

通常数据集并不规整，需要进行一系列数据预处理，如最基本的：归一化。像数据预处理这样的步骤常常是一系列并且固定不变的，我们为了使用方便并避免错放顺序，可使用“流水线”对“步骤们”进行封装，就像一个函数封装（代表）了一系列操作一样。使用scikit-learn的步骤稍升级的框架为：
这里写图片描述

X_broken = np.array(x)
X_broken[:,::2] /= 10#every other line, divide the second feature values by 10
from sklearn.preprocessing import MinMaxScaler#normalization [0~1]
from sklearn.pipeline import Pipeline
scaling_pipeline = Pipeline([('scale', MinMaxScaler()),('predict', KNeighborsClassifier())])#create pipeline including normalization and classifier
scores = cross_val_score(scaling_pipeline, X_broken, y,scoring='accuracy')#cross validation
print("The pipeline scored an average accuracy for is {0:.1f}%".format(np.mean(scores) * 100))#print is "The pipeline scored an average accuracy for is 82.3%"