1.Cross Validation (交叉验证)
cross validation大概的意思是:对于原始数据我们要将其一部分分为train_data,一部分分为test_data。train_data用于训练,test_data用于测试准确率。在test_data上测试的结果叫做validation_error。将一个算法作用于一个原始数据,我们不可能只做出随机的划分一次train和test_data,然后得到一个validation_error,就作为衡量这个算法好坏的标准。因为这样存在偶然性。我们必须好多次的随机的划分train_data和test_data,分别在其上面算出各自的validation_error。这样就有一组validation_error,根据这一组validation_error,就可以较好的准确的衡量算法的好坏。
cross validation是在数据量有限的情况下的非常好的一个evaluate performance的方法。而对原始数据划分出train data和test data的方法有很多种,这也就造成了cross validation的方法有很多种。
sklearn中的cross validation模块,最主要的函数是如下函数:
sklearn.cross_validation.cross_val_score:他的调用形式是scores = cross_validation.cross_val_score(clf, raw_data, raw_target, cv=5, score_func=None)
参数解释:
clf:表示的是不同的分类器,可以是任何的分类器。比如支持向量机分类器。clf = svm.SVC(kernel=’linear’, C=1);
raw_data:原始数据;
raw_target:原始类别标号;
cv:代表的就是不同的cross validation的方法了。引用scikit-learn上的一句话(When the cv argument is an integer, cross_val_score uses the KFold or StratifiedKFold strategies by default, the latter being used if the estimator derives from ClassifierMixin.)如果cv是一个int数字的话,那么默认使用的是KFold或者StratifiedKFold交叉,如果如果指定了类别标签则使用的是StratifiedKFold。
cross_val_score:这个函数的返回值就是对于每次不同的的划分raw_data时,在test_data上得到的分类的准确率。至于准确率的算法可以通过score_func参数指定,如果不指定的话,是用clf默认自带的准确率算法。
scikit-learn的cross-validation交叉验证代码:
<code class="hljs python has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-top-left-radius: 0px; border-top-right-radius: 0px; border-bottom-right-radius: 0px; border-bottom-left-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-prompt" style="color: rgb(0, 102, 102); box-sizing: border-box;">>>> </span><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">from</span> sklearn <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> cross_validation <span class="hljs-prompt" style="color: rgb(0, 102, 102); box-sizing: border-box;">>>> </span><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">from</span> sklearn <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> svm <span class="hljs-prompt" style="color: rgb(0, 102, 102); box-sizing: border-box;">>>> </span>clf = svm.SVC(kernel=<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'linear'</span>, C=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>) <span class="hljs-prompt" style="color: rgb(0, 102, 102); box-sizing: border-box;">>>> </span>scores = cross_validation.cross_val_score(clf, iris.data, iris.target, cv=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>)<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#5-fold cv</span> <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># change metrics</span> <span class="hljs-prompt" style="color: rgb(0, 102, 102); box-sizing: border-box;">>>> </span><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">from</span> sklearn <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> metrics <span class="hljs-prompt" style="color: rgb(0, 102, 102); box-sizing: border-box;">>>> </span>cross_validation.cross_val_score(clf, iris.data, iris.target, cv=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>, score_func=metrics.f1_score) <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#f1 score: http://en.wikipedia.org/wiki/F1_score</span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li></ul>
Note: if using LR, clf = LogisticRegression().
生成一个数据集做为交叉验证
<code class="hljs lua has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-top-left-radius: 0px; border-top-right-radius: 0px; border-bottom-right-radius: 0px; border-bottom-left-radius: 0px; word-wrap: normal; background: transparent;">>>> import numpy as np >>> from sklearn.cross_validation import train_test_split >>> X, y = np.arange(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">10</span>).reshape((<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>)), range(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>) >>> X array(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">[[0, 1], [2, 3], [4, 5], [6, 7], [8, 9]]</span>) >>> list(y) [<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>]</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li></ul>
将数据切分为训练集和测试集
<code class="hljs lua has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-top-left-radius: 0px; border-top-right-radius: 0px; border-bottom-right-radius: 0px; border-bottom-left-radius: 0px; word-wrap: normal; background: transparent;">>>> X_train, X_test, y_train, y_test = train_test_split( ... X, y, test_size=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.33</span>, random_state=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">42</span>) ... >>> X_train array(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">[[4, 5], [0, 1], [6, 7]]</span>) >>> y_train [<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>] >>> X_test array(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">[[2, 3], [8, 9]]</span>) >>> y_test [<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>]</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li></ul>
交叉验证的使用
下面是手动划分训练集和测试集,控制台中输入下列代码进行测试:
<code class="hljs python has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-top-left-radius: 0px; border-top-right-radius: 0px; border-bottom-right-radius: 0px; border-bottom-left-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-prompt" style="color: rgb(0, 102, 102); box-sizing: border-box;">>>> </span><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> numpy <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">as</span> np <span class="hljs-prompt" style="color: rgb(0, 102, 102); box-sizing: border-box;">>>> </span><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">from</span> sklearn <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> cross_validation <span class="hljs-prompt" style="color: rgb(0, 102, 102); box-sizing: border-box;">>>> </span><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">from</span> sklearn <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> datasets <span class="hljs-prompt" style="color: rgb(0, 102, 102); box-sizing: border-box;">>>> </span><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">from</span> sklearn <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> svm <span class="hljs-prompt" style="color: rgb(0, 102, 102); box-sizing: border-box;">>>> </span>iris = datasets.load_iris() <span class="hljs-prompt" style="color: rgb(0, 102, 102); box-sizing: border-box;">>>> </span>iris.data.shape, iris.target.shape ((<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">150</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">150</span>,)) <span class="hljs-prompt" style="color: rgb(0, 102, 102); box-sizing: border-box;">>>> </span>X_train, X_test, y_train, y_test = cross_validation.train_test_split( <span class="hljs-prompt" style="color: rgb(0, 102, 102); box-sizing: border-box;">... </span> iris.data, iris.target, test_size=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.4</span>, random_state=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>) <span class="hljs-prompt" style="color: rgb(0, 102, 102); box-sizing: border-box;">>>> </span>X_train.shape, y_train.shape ((<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">90</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">90</span>,)) <span class="hljs-prompt" style="color: rgb(0, 102, 102); box-sizing: border-box;">>>> </span>X_test.shape, y_test.shape ((<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">60</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">60</span>,)) <span class="hljs-prompt" style="color: rgb(0, 102, 102); box-sizing: border-box;">>>> </span>clf = svm.SVC(kernel=<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'linear'</span>, C=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>).fit(X_train, y_train) <span class="hljs-prompt" style="color: rgb(0, 102, 102); box-sizing: border-box;">>>> </span>clf.score(X_test, y_test) <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.96</span>...</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li><li style="box-sizing: border-box; padding: 0px 5px;">16</li></ul>
下面是交叉验证的实例:
<code class="hljs r has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-top-left-radius: 0px; border-top-right-radius: 0px; border-bottom-right-radius: 0px; border-bottom-left-radius: 0px; word-wrap: normal; background: transparent;">>>> clf = svm.SVC(kernel=<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'linear'</span>, C=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>) >>> scores = cross_validation.cross_val_score( <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">...</span> clf, iris.data, iris.target, cv=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>) <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">...</span> >>> scores array([ <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.96</span><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">...</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1.</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">...</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.96</span><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">...</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.96</span><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">...</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1.</span> ])</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li></ul>
通过cross_validation,设置cv=5,进行5倍交叉验证,最后得到一个scores的预测准确率数组,表示每次交叉验证得到的准确率。
<code class="hljs perl has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-top-left-radius: 0px; border-top-right-radius: 0px; border-bottom-right-radius: 0px; border-bottom-left-radius: 0px; word-wrap: normal; background: transparent;">>>> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">print</span>(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"Accuracy: <span class="hljs-variable" style="color: rgb(102, 0, 102); box-sizing: border-box;">%0</span>.2f (+/- <span class="hljs-variable" style="color: rgb(102, 0, 102); box-sizing: border-box;">%0</span>.2f)"</span> % (scores.mean(), scores.std() * <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>)) Accuracy: <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>.<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">98</span> (+<span class="hljs-regexp" style="color: rgb(0, 136, 0); box-sizing: border-box;">/- 0.03)</span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li></ul>
通过scores.mean()求出平均值,得到平均精度。还可以通过指定scoring来设置准确率算法
<code class="hljs r has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-top-left-radius: 0px; border-top-right-radius: 0px; border-bottom-right-radius: 0px; border-bottom-left-radius: 0px; word-wrap: normal; background: transparent;">>>> from sklearn import metrics >>> scores = cross_validation.cross_val_score(clf, iris.data, iris.target, <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">...</span> cv=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>, scoring=<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'f1_weighted'</span>) >>> scores array([ <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.96</span><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">...</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1.</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">...</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.96</span><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">...</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.96</span><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">...</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1.</span> ])</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li></ul>
libsvm格式的数据导入:
<code class="hljs r has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-top-left-radius: 0px; border-top-right-radius: 0px; border-bottom-right-radius: 0px; border-bottom-left-radius: 0px; word-wrap: normal; background: transparent;">>>> from sklearn.datasets import load_svmlight_file >>> X_train, y_train = load_svmlight_file(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"/path/to/train_dataset.txt"</span>) <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">...</span> >>>X_train.todense()<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#将稀疏矩阵转化为完整特征矩阵</span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li></ul>
2.处理非均衡问题
对于正负样本比例相差较大的非均衡问题,一种调节分类器的方法就是对分类器的训练数据进行改造。一种是欠抽样,一种是过抽样。过抽样意味着赋值样例,而欠抽样意味着删除样例。对于过抽样,最后可能导致过拟合问题;而对于欠抽样,则删掉的样本中可能包含某些重要的信息,会导致欠拟合。对于正例样本较少的情况下,通常采取的方式是使用反例类别的欠抽样和正例类别的过抽样相混合的方法
3.scikit-learn学习SVM
<code class="hljs r has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-top-left-radius: 0px; border-top-right-radius: 0px; border-bottom-right-radius: 0px; border-bottom-left-radius: 0px; word-wrap: normal; background: transparent;">>>> from sklearn import datasets >>> iris = datasets.load_iris() >>> digits = datasets.load_digits() >>> print digits.data [[ <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5.</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">...</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>] [ <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">...</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">10.</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>] [ <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">...</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">16.</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">9.</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>] <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">...</span>, [ <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1.</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">...</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">6.</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>] [ <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2.</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">...</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">12.</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>] [ <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">10.</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">...</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">12.</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1.</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>]] >>> digits.target array([<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>, <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">...</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">9</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8</span>]) >>> digits.images[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>] array([[ <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5.</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">13.</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">9.</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1.</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>], [ <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">13.</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">15.</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">10.</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">15.</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5.</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>], [ <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3.</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">15.</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2.</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">11.</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8.</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>], [ <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4.</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">12.</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8.</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8.</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>], [ <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5.</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8.</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">9.</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8.</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>], [ <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4.</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">11.</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1.</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">12.</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">7.</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>], [ <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2.</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">14.</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5.</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">10.</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">12.</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>], [ <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">6.</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">13.</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">10.</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>]]) >>> from sklearn import svm >>> clf = svm.SVC(gamma=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.001</span>, C=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">100.</span>) >>> clf.fit(digits.data[:-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>],digits.target[:-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>]) SVC(C=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">100.0</span>, cache_size=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">200</span>, class_weight=None, coef0=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.0</span>, degree=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>, gamma=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.001</span>, kernel=<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'rbf'</span>, max_iter=-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>, probability=False, random_state=None, shrinking=True, tol=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.001</span>, verbose=False) >>> clf.predict(digits.data[-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>]) array([<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8</span>]) >>> </code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li><li style="box-sizing: border-box; padding: 0px 5px;">16</li><li style="box-sizing: border-box; padding: 0px 5px;">17</li><li style="box-sizing: border-box; padding: 0px 5px;">18</li><li style="box-sizing: border-box; padding: 0px 5px;">19</li><li style="box-sizing: border-box; padding: 0px 5px;">20</li><li style="box-sizing: border-box; padding: 0px 5px;">21</li><li style="box-sizing: border-box; padding: 0px 5px;">22</li><li style="box-sizing: border-box; padding: 0px 5px;">23</li><li style="box-sizing: border-box; padding: 0px 5px;">24</li><li style="box-sizing: border-box; padding: 0px 5px;">25</li><li style="box-sizing: border-box; padding: 0px 5px;">26</li><li style="box-sizing: border-box; padding: 0px 5px;">27</li><li style="box-sizing: border-box; padding: 0px 5px;">28</li><li style="box-sizing: border-box; padding: 0px 5px;">29</li><li style="box-sizing: border-box; padding: 0px 5px;">30</li><li style="box-sizing: border-box; padding: 0px 5px;">31</li></ul>
3.scikit-learn学习RandomForest
使用例子
<code class="hljs lua has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-top-left-radius: 0px; border-top-right-radius: 0px; border-bottom-right-radius: 0px; border-bottom-left-radius: 0px; word-wrap: normal; background: transparent;">>>> from sklearn.ensemble import RandomForestClassifier >>> X = <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">[[0, 0], [1, 1]]</span> >>> Y = [<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>] >>> clf = RandomForestClassifier(n_estimators=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">10</span>) >>> clf = clf.fit(X, Y)</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li></ul>
Method
randomForestClassifier分类器的初始值
<code class="hljs python has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-top-left-radius: 0px; border-top-right-radius: 0px; border-bottom-right-radius: 0px; border-bottom-left-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-function" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">def</span> <span class="hljs-title" style="box-sizing: border-box;">__init__</span><span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">(self, n_estimators=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">10</span>, criterion=<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"gini"</span>, max_depth=None, min_samples_split=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>, min_samples_leaf=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>, min_weight_fraction_leaf=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>, max_features=<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"auto"</span>, max_leaf_nodes=None, bootstrap=True, oob_score=False, n_jobs=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>, random_state=None, verbose=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>, warm_start=False, </span></span></code><p><code class="hljs python has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-top-left-radius: 0px; border-top-right-radius: 0px; border-bottom-right-radius: 0px; border-bottom-left-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-function" style="box-sizing: border-box;"><span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;"> class_weight=None)</span>:</span></code><code class="hljs python has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-top-left-radius: 0px; border-top-right-radius: 0px; border-bottom-right-radius: 0px; border-bottom-left-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-function" style="box-sizing: border-box;"> </span></code><code class="hljs python has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-top-left-radius: 0px; border-top-right-radius: 0px; border-bottom-right-radius: 0px; border-bottom-left-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-function" style="box-sizing: border-box;"> </span></code><code class="hljs python has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-top-left-radius: 0px; border-top-right-radius: 0px; border-bottom-right-radius: 0px; border-bottom-left-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-function" style="box-sizing: border-box;">http://www.360doc.com/content/16/0626/16/20558639_570898095.shtml </span></code></p>