用Python和scikit-learn来介绍机器学习
数据加载
当我们学习机器学习的时候,首先必须得有数据,我们得把数据加载到内存中才能对它进行处理。这一节我们先介绍如何加载数据的问题。我们以从著名的UCI Machine Learning Repository.加载.csv格式的文件来说明。
<code class="language-python hljs has-numbering" style="display: block; padding: 0px; background: transparent; color: inherit; box-sizing: border-box; font-family: "Source Code Pro", monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> numpy <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">as</span> np <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># the python version is 3.5</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> urllib.request <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># url with dataset</span> url=<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"</span> <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># download the file</span> raw_data = urllib..request.urlopen(url) <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># load the CSV file as a numpy matrix</span> dataset = np.loadtxt(raw_data,delimiter=<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">","</span>) <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># separate the data from the target attributes</span> X = dataset[:,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8</span>] Y = dataset[:,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8</span>]</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; background-color: rgb(238, 238, 238); top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right: 1px solid rgb(221, 221, 221); list-style: none; text-align: right;"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li></ul><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; background-color: rgb(238, 238, 238); top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right: 1px solid rgb(221, 221, 221); list-style: none; text-align: right;"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li></ul>
说明:下面的讨论我们将使用这里加载的数据集,这里 X
为特征矩阵,Y
为类别标签向量。
数据正则化
几乎所有的机器学习算法基于梯度方法,而大部分梯度方法对数据的尺度高度敏感,因此在我们运行机器学习算法之前,我们应该进行正则化(normalization),或者所谓的标准化(standardization)
- 正则化:每一个特征的范围变为从
0
到1
。 - 标准化:每一个特征有一个平均值
0
和方差1
。
<code class="language-python hljs has-numbering" style="display: block; padding: 0px; background: transparent; color: inherit; box-sizing: border-box; font-family: "Source Code Pro", monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">from</span> sklearn <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> preprocessing <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># normalize the data attributes</span> normalized_X = preprocessing.normalize(X) <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># standardize the data attributes</span> standardized_X = preprocessing.scale(X)</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; background-color: rgb(238, 238, 238); top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right: 1px solid rgb(221, 221, 221); list-style: none; text-align: right;"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li></ul><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; background-color: rgb(238, 238, 238); top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right: 1px solid rgb(221, 221, 221); list-style: none; text-align: right;"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li></ul>
特征选择
我们在处理机器学习任务时最重要的能力是恰当的选择或者甚至创造特征,这被称为特征选择或者特征工程。尽管特征工程是一个非常的创造性过程,并且更多的依赖直觉和领域知识,但是有许多已有的特征选择方法。树方法(Tree algorithms)允许计算特征的信息量。
<code class="language-python hljs has-numbering" style="display: block; padding: 0px; background: transparent; color: inherit; box-sizing: border-box; font-family: "Source Code Pro", monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">from</span> sklearn <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> metrics <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">from</span> sklearn.ensemble <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> ExtraTreesClassifier model = ExtraTreesClassifier() model.fit(X, y) <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># display the relative importance of each attribute</span> print(model.feature_importances_)</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; background-color: rgb(238, 238, 238); top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right: 1px solid rgb(221, 221, 221); list-style: none; text-align: right;"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li></ul><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; background-color: rgb(238, 238, 238); top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right: 1px solid rgb(221, 221, 221); list-style: none; text-align: right;"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li></ul>
所有其他的方法都是基于有效的搜索特征子集去找到最好的特征子集以使开发的模型提供最好的效果。其中的一种搜索算法是递归特征消减算法(Recursive Feature Elimination)
<code class="language-python hljs has-numbering" style="display: block; padding: 0px; background: transparent; color: inherit; box-sizing: border-box; font-family: "Source Code Pro", monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">from</span> sklearn.feature_selection <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> RFE <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">from</span> sklearn.linear_model <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> LogisticRegression model = LogisticRegression() <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># create the RFE model and select 3 attributes</span> rfe = RFE(model, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>) rfe = rfe.fit(X, y) <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># summarize the selection of the attributes</span> print(rfe.support_) print(rfe.ranking_)</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; background-color: rgb(238, 238, 238); top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right: 1px solid rgb(221, 221, 221); list-style: none; text-align: right;"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li></ul><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; background-color: rgb(238, 238, 238); top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right: 1px solid rgb(221, 221, 221); list-style: none; text-align: right;"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li></ul>
算法开发
LR
这个算法的好处是输出是每一个对象属于一个类的概率
<code class="language-python hljs has-numbering" style="display: block; padding: 0px; background: transparent; color: inherit; box-sizing: border-box; font-family: "Source Code Pro", monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">from</span> sklearn <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> metrics <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">from</span> sklearn.linear_model <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> LogisticRegression model = LogisticRegression() model.fit(X, y) print(model) <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># make predictions</span> expected = y predicted = model.predict(X) <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># summarize the fit of the model</span> print(metrics.classification_report(expected, predicted)) print(metrics.confusion_matrix(expected, predicted))</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; background-color: rgb(238, 238, 238); top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right: 1px solid rgb(221, 221, 221); list-style: none; text-align: right;"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li></ul><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; background-color: rgb(238, 238, 238); top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right: 1px solid rgb(221, 221, 221); list-style: none; text-align: right;"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li></ul>
朴素贝叶斯
这个算法的主要任务是恢复训练样本的数据分布密度,这个方法在多类分类问题中经常提供好的效果
<code class="language-python hljs has-numbering" style="display: block; padding: 0px; background: transparent; color: inherit; box-sizing: border-box; font-family: "Source Code Pro", monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">from</span> sklearn <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> metrics <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">from</span> sklearn.naive_bayes <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> GaussianNB model = GaussianNB() model.fit(X, y) print(model) <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># make predictions</span> expected = y predicted = model.predict(X) <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># summarize the fit of the model</span> print(metrics.classification_report(expected, predicted)) print(metrics.confusion_matrix(expected, predicted))</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; background-color: rgb(238, 238, 238); top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right: 1px solid rgb(221, 221, 221); list-style: none; text-align: right;"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li></ul><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; background-color: rgb(238, 238, 238); top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right: 1px solid rgb(221, 221, 221); list-style: none; text-align: right;"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li></ul>
K最近邻
当参数(大部数为度量)被很好的设置时,这个算法对回归问题
能提供好的效果。
<code class="language-python hljs has-numbering" style="display: block; padding: 0px; background: transparent; color: inherit; box-sizing: border-box; font-family: "Source Code Pro", monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">from</span> sklearn <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> metrics <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">from</span> sklearn.neighbors <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> KNeighborsClassifier <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># fit a k-nearest neighbor model to the data</span> model = KNeighborsClassifier() model.fit(X, y) print(model) <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># make predictions</span> expected = y predicted = model.predict(X) <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># summarize the fit of the model</span> print(metrics.classification_report(expected, predicted)) print(metrics.confusion_matrix(expected, predicted))</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; background-color: rgb(238, 238, 238); top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right: 1px solid rgb(221, 221, 221); list-style: none; text-align: right;"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li></ul><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; background-color: rgb(238, 238, 238); top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right: 1px solid rgb(221, 221, 221); list-style: none; text-align: right;"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li></ul>
决策树
Classification and Regression Trees (CART) 经常被用在对象拥有分类特征和回归、分类问题,非常适合多分类问题。
<code class="language-python hljs has-numbering" style="display: block; padding: 0px; background: transparent; color: inherit; box-sizing: border-box; font-family: "Source Code Pro", monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">from</span> sklearn <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> metrics <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">from</span> sklearn.tree <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> DecisionTreeClassifier <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># fit a CART model to the data</span> model = DecisionTreeClassifier() model.fit(X, y) print(model) <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># make predictions</span> expected = y predicted = model.predict(X) <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># summarize the fit of the model</span> print(metrics.classification_report(expected, predicted)) print(metrics.confusion_matrix(expected, predicted))</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; background-color: rgb(238, 238, 238); top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right: 1px solid rgb(221, 221, 221); list-style: none; text-align: right;"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li></ul><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; background-color: rgb(238, 238, 238); top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right: 1px solid rgb(221, 221, 221); list-style: none; text-align: right;"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li></ul>
支持向量机
是最流行的分类问题机器学习算法之一,通过“一对多”的方法可以和LR一样用来多分类。
<code class="language-python hljs has-numbering" style="display: block; padding: 0px; background: transparent; color: inherit; box-sizing: border-box; font-family: "Source Code Pro", monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">from</span> sklearn <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> metrics <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">from</span> sklearn.svm <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> SVC <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># fit a SVM model to the data</span> model = SVC() model.fit(X, y) print(model) <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># make predictions</span> expected = y predicted = model.predict(X) <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># summarize the fit of the model</span> print(metrics.classification_report(expected, predicted)) print(metrics.confusion_matrix(expected, predicted))</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; background-color: rgb(238, 238, 238); top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right: 1px solid rgb(221, 221, 221); list-style: none; text-align: right;"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li></ul><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; background-color: rgb(238, 238, 238); top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right: 1px solid rgb(221, 221, 221); list-style: none; text-align: right;"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li></ul>
怎样优化算法参数
在创造实际有效的算法中最困难的一个阶段是选择正确的参数,对于有足够的经验来说,这是很容易的。但是不管怎样,我们不得不做搜索。幸运的是,Scikit-learn提供了许多有效的函数来做这件事。
作为一个例子,我们来看一看正则化参数的选取。
<code class="language-python hljs has-numbering" style="display: block; padding: 0px; background: transparent; color: inherit; box-sizing: border-box; font-family: "Source Code Pro", monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> numpy <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">as</span> np <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">from</span> sklearn.linear_model <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> Ridge <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">from</span> sklearn.grid_search <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> GridSearchCV <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># prepare a range of alpha values to test</span> alphas = np.array([<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.1</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.01</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.001</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.0001</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>]) <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># create and fit a ridge regression model, testing each alpha</span> model = Ridge() grid = GridSearchCV(estimator=model, param_grid=dict(alpha=alphas)) grid.fit(X, y) print(grid) <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># summarize the results of the grid search</span> print(grid.best_score_) print(grid.best_estimator_.alpha)</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; background-color: rgb(238, 238, 238); top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right: 1px solid rgb(221, 221, 221); list-style: none; text-align: right;"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li></ul><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; background-color: rgb(238, 238, 238); top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right: 1px solid rgb(221, 221, 221); list-style: none; text-align: right;"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li></ul>
Sometimes it is more efficient to randomly select a parameter from the given range, estimate the algorithm quality for this parameter and choose the best one.
有时随机地从一个给定的范围选取一个参数是更有效的,对于这个参数评估这个算法并选择最好的一个。
<code class="language-python hljs has-numbering" style="display: block; padding: 0px; background: transparent; color: inherit; box-sizing: border-box; font-family: "Source Code Pro", monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> numpy <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">as</span> np <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">from</span> scipy.stats <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> uniform <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">as</span> sp_rand <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">from</span> sklearn.linear_model <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> Ridge <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">from</span> sklearn.grid_search <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> RandomizedSearchCV <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># prepare a uniform distribution to sample for the alpha parameter</span> param_grid = {<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'alpha'</span>: sp_rand()} <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># create and fit a ridge regression model, testing random alpha values</span> model = Ridge() rsearch = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">100</span>) rsearch.fit(X, y) print(rsearch) <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># summarize the results of the random parameter search</span> print(rsearch.best_score_) print(rsearch.best_estimator_.alpha)</code>