初学数据挖掘算法时,在具体工作中常常不清楚如何选择算法,本文将从耗时的角度进行测试,选择的基础算法有DecisionClassifier、LogisticRegression、LinearSVC、SVC(高斯核函数)。并没有再用SVC(kernel = linear)作对比,LinearSVC就够了。
数据集使用的是sklearn.datasets中的make_moons,下图中数据加入10%噪声。就分类效果来说,对于这种非线性分类来说,逻辑回归和线性svc肯定不会很好,但这次测试主要针对耗费时间所以暂时不讨论分类效果。
先贴代码说明过程:
import numpy as np
import pandas as pd
from sklearn.datasets import make_moons
import time
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
n_samples = np.linspace(10000, 1000000, 100)
#n_features = np.linspace(10, 100, 10)
k = 0
list_DecisionTreeClassifier = []
list_LogisticRegression = []
list_LinearSVC = []
list_SVC = []
for n_sample in n_samples:
# for n_feature in n_features:
start = time.clock()
data = make_moons(n_samples = int(n_sample), noise = 0.2, random_state = 42)
x = data[0]
y = data[1]
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=42)
tree_clf = DecisionTreeClassifier()
log_clf = LogisticRegression()
linearsvc_clf = LinearSVC()
svc_clf = SVC()
for clf in [tree_clf, log_clf, linearsvc_clf, svc_clf]:
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
k = k + 1
print(clf.__class__.__name__, ' n_sample =', int(n_sample),
'\ntrain_score: ', clf.score(x_train, y_train), 'test_score: ', clf.score(x_test, y_test),
'\naccuracy_score: ', accuracy_score(y_test, y_pred),
'\nrecall_score: ', recall_score(y_test, y_pred, average = 'micro'),
'\ntime: ', time.clock() - start,
'\nk =', k,
'\n---------------------------------------------------------------')
if clf == tree_clf:
list_DecisionTreeClassifier.append(time.clock() - start)
elif clf == log_clf:
list_LogisticRegression.append(time.clock() - start)
elif clf == linearsvc_clf:
list_LinearSVC.append(time.clock() - start)
elif clf == svc_clf:
list_SVC.append(time.clock() - start)
print(list_DecisionTreeClassifier)
print(list_LogisticRegression)
print(list_LinearSVC)
print(list_SVC)
数据如下:
第一行是样本数量,从10k到1000k,也就是百万,所有样本均为两个特征。
一共跑了20.24小时,其中决策树0.047h、逻辑回归0.065h、LinearSVC 0.387h、SVC(高斯核函数) 19.74h。
n_samples(k) | DecisionClassifier | LogisticRegression | LinearSVC | SVC |
10 | 0.02883747 | 0.049836921 | 0.078581678 | 0.327533809 |
20 | 0.058209357 | 0.072491098 | 0.12397223 | 0.959392353 |
30 | 0.055744011 | 0.076332479 | 0.152550657 | 2.102193433 |
40 | 0.075380437 | 0.102246571 | 0.203200135 | 3.242318279 |
50 | 0.096630075 | 0.12982895 | 0.268338368 | 4.742749603 |
60 | 0.123576192 | 0.163956345 | 0.328781705 | 6.687300061 |
70 | 0.168099112 | 0.213670662 | 0.441686996 | 9.074958383 |
80 | 0.168602809 | 0.220825094 | 0.428130358 | 11.48451644 |
90 | 0.211460762 | 0.272062956 | 0.523074432 | 14.64752711 |
100 | 0.24415954 | 0.313279468 | 0.654271114 | 17.69263482 |
110 | 0.244321442 | 0.325576588 | 0.778064836 | 20.76720159 |
120 | 0.27956278 | 0.378553588 | 0.935907076 | 49.65587547 |
130 | 0.32607641 | 0.460912716 | 1.16475368 | 29.29933011 |
140 | 0.346625301 | 0.48400057 | 1.315402189 | 34.02656613 |
150 | 0.372370017 | 0.586220849 | 1.584742415 |