Python机器学习Sklearn入门案例(下)
决策树算法
决策树算法是一种逼近离散函数值的方法,其首先对数据进行处理,利用归纳算法生成可读的规则和决策树,然后使用决策对新数据进行分析。决策树算法构造决策树来发现数据中蕴含的分类规则,如何构精度高、规模小的决策树是决策树算法的核心内容
介绍的 Decision Tree 决策树算法,函数位于 Tree 决策树模块,函数名是DecisionTreeClassifier,接口是
DecisionTreeClassifier(criterion = 'gini',splitter = 'best',max_depth = None,min_samplws_split = 2,min_samples_leaf = 1,min_weight_fraction_leaf = 0.0,max_features = None,random_state = None,max_leaf_nodes =None,min_impurity_split = 1e-07,class_weight = None,presort = False)
建模、预测
#2
print('\n2# 建模')
mx =zai.mx_dtree(x_train.values,y_train.values)
#3
print('\n3# 预测')
y_pred = mx.predict(x_test.values)
df9['y_predsr']=y_pred
df9['y_test'],df9['y_pred']=y_test,y_pred
df9['y_pred']=round(df9['y_predsr']).astype(int)
mx_dtree() 函数
def mx_dtree(train_x,train_y):
mx = tree.DecisionTreeClassifier()
mx.fit(train_x,train_y)
return mx
保存数据结果并显示信息
#4
df9.to_csv('tmp/iris_9.csv',index=False)
print('\n4# df9')
print(df9.tail())
输出
4# df9
x1 x2 x3 x4 y_predsr y_test y_pred
33 6.4 2.8 5.6 2.1 1 1 1
34 5.8 2.8 5.1 2.4 1 1 1
35 5.3 3.7 1.5 0.2 2 2 2
36 5.5 2.3 4.0 1.3 3 3 3
37 5.2 3.4 1.4 0.2 2 2 2
检验测试结果
#5
dacc=zai.ai_acc_xed(df9,1,False)
print('\n5# mx:mx_sum,kok:{0:.2f}%'.format(dacc))
输出
5# mx:mx_sum,kok:97.37%
Decision Tree 决策树算法的结果是 97.37% 与 KNN 近邻算法结果一样
GBDT 迭代决策树算法
GBDT 迭代决策树算法火热,因其在多次国际级人工智能大赛(如 Kaggle)中获得胜利,GBDT 全称 Gradient Boosting Decision Tree。GBDT 算法由多棵决策树组成,所有的树的结论累加起来做最终的结果,可以用来做分类回归,和 SVM 一起被认为泛化能力很强的算法。
介绍的 GBDT 迭代决策树算法函数位于 Ensemble 集成算法模块里,函数名是 GradientBoostClassifier,接口是
GradientBoostClassifier(loss = 'deviance',learning_rate = 0.1,n_estimators = 100,subsample = 1.0,criterion = 'friedman_mse',min_samplws_split = 2,min_samples_leaf = 1,min_weight_fraction_leaf = 0.0,max_depth = 3,max_features = None,verbose = 0,max_leaf_nodes = None,warm_start = False,presort = 'auto')
建模、预测
#2
print('\n2# 建模')
mx =zai.mx_GBDT(x_train.values,y_train.values)
#3
print('\n3# 预测')
y_pred = mx.predict(x_test.values)
df9['y_predsr']=y_pred
df9['y_test'],df9['y_pred']=y_test,y_pred
df9['y_pred']=round(df9['y_predsr']).astype(int)
mx_GBDT() 函数
def mx_GBDT(train_x,train_y):
mx = GradientBoostClassifier(n_estimators = 200)
mx.fit(train_x,train_y)
return mx
保存数据结果并显示信息
#4
df9.to_csv('tmp/iris_9.csv',index=False)
print('\n4# df9')
print(df9.tail())
输出
4# df9
x1 x2 x3 x4 y_predsr y_test y_pred
33 6.4 2.8 5.6 2.1 1 1 1
34 5.8 2.8 5.1 2.4 1 1 1
35 5.3 3.7 1.5 0.2 2 2 2
36 5.5 2.3 4.0 1.3 3 3 3
37 5.2 3.4 1.4 0.2 2 2 2
检验测试结果
#5
dacc=zai.ai_acc_xed(df9,1,False)
print('\n5# mx:mx_sum,kok:{0:.2f}%'.format(dacc))
输出
5# mx:mx_sum,kok:97.37%
GBDT 迭代决策树算法的结果是 97.37% ,与Decision Tree 决策树算法、 KNN 近邻算法结果一样
SVM 向量机算法
介绍的 SVM 向量机算法函数位于 SVM 算法模块里,函数名是 KVC,接口是
SVC(c = 1.0,kernel = 'rbf',degree = 3,gamma = 'auto',coef0 = 0.0,shrinking = True,probability = False,tol = 0.001,cache_size = 200,class_weight = None,verbose = False,max_iter = -1,decision_function_shape = None,random_state = None)
建模、预测
#2
print('\n2# 建模')
mx =zai.mx_svm(x_train.values,y_train.values)
#3
print('\n3# 预测')
y_pred = mx.predict(x_test.values)
df9['y_predsr']=y_pred
df9['y_test'],df9['y_pred']=y_test,y_pred
df9['y_pred']=round(df9['y_predsr']).astype(int)
mx_svm() 函数
def mx_svm(train_x,train_y):
mx = SVC(kernel = 'rbf',probability = true)
mx.fit(train_x,train_y)
return mx
保存数据结果并显示信息
#4
df9.to_csv('tmp/iris_9.csv',index=False)
print('\n4# df9')
print(df9.tail())
输出
4# df9
x1 x2 x3 x4 y_predsr y_test y_pred
33 6.4 2.8 5.6 2.1 1 1 1
34 5.8 2.8 5.1 2.4 1 1 1
35 5.3 3.7 1.5 0.2 2 2 2
36 5.5 2.3 4.0 1.3 3 3 3
37 5.2 3.4 1.4 0.2 2 2 2
检验测试结果
#5
dacc=zai.ai_acc_xed(df9,1,False)
print('\n5# mx:mx_sum,kok:{0:.2f}%'.format(dacc))
输出
5# mx:mx_sum,kok:97.37%
SVM 向量机算法的结果也是 97.37%
SVM-cross 向量机交叉算法
函数定义
def mx_svm_cross(train_x,train_y):
mx = SVC(kernel = 'rbf',probability = True)
param_grid = {'C':[1e-3,1e-2,1e-1,1,10,100,1000],'gamma':[0.001,0.0001]}
grid_search = GridSearchCV(mx,param_grid,n_jobs = 1,verbose = 1)
grid_search.fit(train_x,train_y)
best_parameters = grid_search.best_estimator_.get_params()
mx = SVC(kernel = 'rbf',C = best_parameters['c'],gamma = best_parameter['gamma'],probability = true)
mx.fit(train_x,train_y)
return mx
交叉验证基本思想是将原始数据进行分组,一部分作为训练集,另一部分作为验证集,首先用训练集对分类器进行训练,再利用验证集来测试训练得到的模型,来评价分类器的性能
建模、预测
#2
print('\n2# 建模')
mx =zai.mx_svm_cross(x_train.values,y_train.values)
#3
print('\n3# 预测')
y_pred = mx.predict(x_test.values)
df9['y_predsr']=y_pred
df9['y_test'],df9['y_pred']=y_test,y_pred
df9['y_pred']=round(df9['y_predsr']).astype(int)
保存数据结果并显示信息
#4
df9.to_csv('tmp/iris_9.csv',index=False)
print('\n4# df9')
print(df9.tail())
输出
4# df9
x1 x2 x3 x4 y_predsr y_test y_pred
33 6.4 2.8 5.6 2.1 1 1 1
34 5.8 2.8 5.1 2.4 1 1 1
35 5.3 3.7 1.5 0.2 2 2 2
36 5.5 2.3 4.0 1.3 3 3 3
37 5.2 3.4 1.4 0.2 2 2 2
检验测试结果
#5
dacc=zai.ai_acc_xed(df9,1,False)
print('\n5# mx:mx_sum,kok:{0:.2f}%'.format(dacc))
输出
5# mx:mx_sum,kok:94.74%
理论上SVM-cross 向量机交叉算法的效果应该比 SVM 算法更好,因为多了交叉验证。不过这里只有 94.74% 也是正常,因为有时候会出现过度迭代,Iris 数据量也太少,只有 150 组,也证明小数据理论:
人工智能、机器学习不是数据量越大越好,也不是迭代计算的次数越多越好