（七）模型调优

最新推荐文章于 2024-07-12 10:04:34 发布

Luminita_myl

最新推荐文章于 2024-07-12 10:04:34 发布

阅读量982

点赞数

分类专栏：笔记文章标签：机器学习分类人工智能

本文链接：https://blog.csdn.net/mylnn/article/details/120790811

版权

笔记专栏收录该内容

7 篇文章 1 订阅

订阅专栏

模型调优

1 参数估计
2 机器学习的极大似然估计
3 模型调优
4 流失预警模型的调优

随机变量参数的点估计：矩估计、极大似然估计
统计学习的极大似然估计：线性回归的极大似然估计、逻辑回归的极大似然估计
统计学习的模型调优：选取最优模型的标准、模型的内部有效与外部有效、机器学习模型调优的方案。

统计推断基本问题：参数估计问题（点估计、区间估计）、假设检验问题

1 参数估计

点估计：估计未知参数的值——矩估计、极大似然估计、最小二乘估计法、贝叶斯估计
区间估计：估计未知参数的取值范围，使得这个范围包含未知参数真值的概率为给定的值
#矩估计：一阶矩、二阶矩，样本矩估计中心矩

2 机器学习的极大似然估计

极大似然估计是统计学与机器学习共用的算法
机器学习方法论：选择算法、目标函数、计算方法
器学习可以理解为通过算法对数据进行处理，构建出学习模型，并对模型性能进行评估，如果达到要求就拿这个模型来测试其他数据，如果达不到就调整算法重新建立模型，再次进行评估，如此循环迭代，直到最终获得满意的效果来处理其他数据。

Y连续，线性回归、回归树、神经网络
Y二分类，决策树、逻辑回归、神经网络、支持向量机、朴素贝叶斯
目标函数：似然函数、熵、损失函数、合页函数(SVM)

#机器学习注重可行性、效率；统计学处理注重方法严谨性适用性，小数据。统计学是机器学习的思想指南
#机器学习里含有参数、超参数（调节选择最优模型的）

3 模型调优

根据各种指标评估指标，选择最合理的超参数。

机器学习算法的超参数：
逻辑回归：LogisticRegression(penalty=' l2 ', dual=False, tol=0.0001, C=1.0, fit_intercept=True)
决策树：DecisionTreeClassifier(criterion='gini',splitter=' best ' , max_depth=None,min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, ...)
BP神经网络：MLPClassifier(hidden_layer_sizes=(100,),activation=' relu ', solver=' adam ' , ...)
SVM：SVC(C=1.0, kernel=' rbf ' , degree=3 , gamma=' auto ' , coef0=0.0, shrinking=True, ...)
KNN：NearestNeighbors(n_neighbors=5, radius=1.0 , algorithm=' auto ', ...)
朴素贝叶斯：没有超参数

二分类模型的评估指标
偏度—方差权衡
模型越复杂，在训练数据集中偏度小，在测试数据集中方差大
偏度越小，预测的越准；方差越小，预测的越稳定
在训练数据集train中建模，在测试数据集test看模型表现
超参数可以控制模型惩罚度

4 流失预警模型的调优

import os
import pandas as pd
import numpy as np
from scipy import stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt

#导入数据，清洗数据
os.chdir(r'D:\python商业实践\《Python数据科学技术详解与商业实践》PDF+源代码+八大案例\《Python数据科学技术详解与商业实践》PDF+源代码+八大案例\源代码\Python_book\8Logistic')    
churn=pd.read_csv('telecom_churn.csv',skipinitialspace=True)

#分类变量的相关关系  posTrend&chrun
##交叉表
cross_table=pd.crosstab(churn.posTrend,churn.churn,margins=True)
##列联表
def percConvert(ser):
    return ser/float(ser[-1])
cross_table.apply(percConvert,axis=1)
##卡方检验
print('''chisq=%6.6f
p-value=%6.4f
dof=%i
expected_freq=%s'''%stats.chi2_contingency(cross_table.iloc[:2,:2]))

#逻辑回归 duration（在网时长）&chrun
churn.plot(x='duration',y='churn',kind='scatter')
##随机抽样，建立训练集与测试集
train=churn.sample(frac=0.7,random_state=1234).copy()
test=churn[~churn.index.isin(train.index)].copy()
print('训练集样本量：%i \n 测试样本集:%i'  %(len(train),len(test)))
lg=smf.glm('churn~duration',data=train,
          family=sm.families.Binomial(sm.families.links.logit)).fit()
lg.summary()
##预测
train['proba']=lg.predict(train)
test['proba']=lg.predict(test)
test['proba'].head()

#模型评估    测试数据集模型评估
##设定阈值
test['prediction']=(test['proba']>0.3).astype('int')
##混淆矩阵
pd.crosstab(test.churn,test.prediction,margins=True)
##计算准确度
acc=sum(test['prediction']==test['churn'])/np.float(len(test))
print('The accurancy is %.2f'%acc)
##设置循环语句得到不同阙值下的指标值
for i in np.arange(0.1, 0.9, 0.1):
    prediction = (test['proba'] > i).astype('int')
    confusion_matrix = pd.crosstab(prediction,test.churn,
                                   margins = True)
    precision = confusion_matrix.loc[0, 0] /confusion_matrix.loc['All', 0]
    recall = confusion_matrix.loc[0, 0] / confusion_matrix.loc[0, 'All']
    Specificity = confusion_matrix.loc[1, 1] /confusion_matrix.loc[1,'All']
    f1_score = 2 * (precision * recall) / (precision + recall)
    print('threshold: %s, precision: %.2f, recall:%.2f ,Specificity:%.2f , f1_score:%.2f'%(i, precision, recall, Specificity,f1_score))  #见图A
    
##绘制ROC曲线
import sklearn.metrics as metrics

fpr_test,tpr_test,th_test=metrics.roc_curve(test.churn,test.proba)
fpr_train,tpr_train,th_train=metrics.roc_curve(train.churn,train.proba)

plt.figure(figsize=[3,3])
plt.plot(fpr_test,tpr_test,'b--')
plt.plot(fpr_train,tpr_train,'r-')
plt.title('ROC curve')
plt.show()   #见图B

print('AUC=%.4f' %metrics.auc(fpr_test,tpr_test)) #AUC=0.8790

#多元逻辑回归  向前法
def forward_select(data, response):
    remaining = set(data.columns)
    remaining.remove(response)
    selected = []
    current_score, best_new_score = float('inf'), float('inf')
    while remaining:
        aic_with_candidates=[]
        for candidate in remaining:
            formula = "{} ~ {}".format(
                response,' + '.join(selected + [candidate]))
            aic = smf.glm(
                formula=formula, data=data, 
                family=sm.families.Binomial(sm.families.links.logit)
            ).fit().aic
            aic_with_candidates.append((aic, candidate))
        aic_with_candidates.sort(reverse=True)
        best_new_score, best_candidate=aic_with_candidates.pop()
        if current_score > best_new_score: 
            remaining.remove(best_candidate)
            selected.append(best_candidate)
            current_score = best_new_score
            print ('aic is {},continuing!'.format(current_score))
        else:        
            print ('forward selection over!')
            break
            
    formula = "{} ~ {} ".format(response,' + '.join(selected))
    print('final formula is {}'.format(formula))
    model = smf.glm(
        formula=formula, data=data, 
        family=sm.families.Binomial(sm.families.links.logit)
    ).fit()
    return(model)

candidates=['churn','duration','AGE','edu_class','posTrend','negTrend','nrProm','prom','curPlan','avgplan','incomeCode','feton','peakMinAv','peakMinDiff','call_10086']
data_for_select=train[candidates]
lg_m1=forward_select(data=data_for_select,response='churn')
lg_m1.summary()

#检查是否共线性
def vif(df,col_i):
    from statsmodels.formula.api import ols
    cols=list(df.columns)
    cols.remove(col_i)
    cols_noti=cols
    formula=col_i+'~'+'+'.join(cols_noti)
    r2=ols(formula,df).fit().rsquared
    return 1./(1.-r2)

exog=train[candidates].drop(['churn'],axis=1)
for i in exog.columns:
    print(i,'\t',vif(df=exog,col_i=i))

在这里插入图片描述
#大于10的代表存在共线性，删选，剔除了三个x，再次检查共线性

图A
图B

使用岭回归和Laso算法建模，使用交叉验证法确定惩罚参数（c值）

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

candidates=['duration','AGE','edu_class','posTrend','negTrend','prom','curPlan','planChange','incomeCode','feton','peakMinAv','peakMinDiff','call_10086']

scaler=StandardScaler()   #标准化
x=scaler.fit_transform(churn[candidates])
y=churn['churn']
#生成搜索空间
from sklearn import linear_model
from sklearn.svm import l1_min_c
cs=l1_min_c(x,y,loss='log')*np.logspace(0,4)  

#建立l1惩罚项的逻辑回归
print('Computing regularization path...')
clf=linear_model.LogisticRegression(C=1.0,penalty='l1',tol=1e-6,solver='liblinear')  
coefs_=[]
for c in cs:
    clf.set_params(C=c)  #设置逻辑回归的惩罚项
    clf.fit(x,y)
    coefs_.append(clf.coef_.ravel().copy())  
    
coefs_=np.array(coefs_)
plt.plot(np.log10(cs),coefs_)
ymin,ymax=plt.ylim()
plt.xlabel('log(C)')
plt.ylabel('Coefficients')
plt.title('Logistic Regression Path')
plt.axis('tight')
plt.show()  ## 见图C

cs=l1_min_c(x,y,loss='log')*np.logspace(0,4)
import matplotlib.pyplot as plt
from sklearn.model_selection import cross_val_score    #k折交叉验证模块

k_scores=[]
clf=linear_model.LogisticRegression(penalty='l1',solver='liblinear')
#用迭代的方式计算不同参数对模型的影响，并返回交叉验证后的平均准确率
for c in cs:
    clf.set_params(C=c)    scores=cross_val_score(clf,x,y,cv=10,scoring='roc_auc')     # 均值越高越好，标准差越小越好
    k_scores.append([c,scores.mean(),scores.std()])
  
#可视化数据
data=pd.DataFrame(k_scores)  #将字典转换为数据框
fig=plt.figure()

ax1=fig.add_subplot(111)
ax1.plot(np.log10(data[0]),data[1],'b')
ax1.set_ylabel('Mean ROC(Blue)')
ax2=ax1.twinx()
ax2.plot(np.log10(data[0]),data[2],'r')
ax2.set_ylabel('Std ROC Index(Red)')    #见图D  得到合理的c值为np.exp(-1.9)   

#带入合理的超参数c 重新实现Laso算法 
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
candidates=['duration','AGE','edu_class','posTrend','negTrend','nrProm','prom','curPlan','avgplan','planChange','incomeCode','feton','peakMinAv','peakMinDiff','call_10086']
scaler=StandardScaler() #标准化
x=scaler.fit_transform(churn[candidates])
y=churn['churn'] 

from sklearn import linear_model
clf=linear_model.LogisticRegression(C=np.exp(-1.9),penalty='l1',solver='liblinear')
clf.fit(x,y)
clf.coef_   #见图E

剔除了两个x，但逐步回归＋vif剔除了3个x，机器学习不一定能找到最好的模型，但能找到差不多的模型

图C
图D
图E 在这里插入图片描述

问题：

建立逻辑回归时，正则化惩罚选择’L1’报错，应添加参数solver，solver='liblinear'。这个参数定义的是分类器，‘newton-cg’，‘sag’和‘lbfgs’等solvers仅支持‘L2’；regularization，‘liblinear’ solver同时支持‘L1’、‘L2’regularization，若dual=Ture，则仅支持L2 penalty。决定惩罚项选择的有2个参数：dual和solver，如果要选L1范数，dual必须是False，solver必须是liblinear。dual默认为False
k折交叉验证模块，用model_selection包，from sklearn.model_selection import cross_val_score

Luminita_myl

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
（七）模型调优

Odds Ratio（优势比）：一个事件在一个组发生的可能性相对于在另一个组的大小Odds（优势）：事件发生的概率与不发生概率之比，值域[0,∞)1 一元连续变量逻辑回归X连续变量 Y二分类变量Logit是Odds的自然对数logit代表隐变量Y’，X用于预测Y’模型评估模型预测的三种结果：一致对（√）、不一致对、相等对（50%√）一致对表statmodels不提供，可用ROC曲线下面积（AUC）代表，AUC就是C统计量ROC曲线y轴为灵敏度，x轴为1-特异度，曲线越趋向左上方越好
复制链接

扫一扫