这段时间踩坑太多,准备写点文章来转转运,今天先分享几行比较有用的 code 起个头
一、持久化
不废话,直接上code
# xgb原生二进制文件,适用于跨平台使用model.save_model('xgbclf.model')# sklearn 框架生成的二进制文件,与 xgb 原生的有些不一样joblib.dump(model,'xgbclf.model')
注:本文采用的是sklearn 框架,采用原生框架的稍微改改code就可以运行啦
二、可解释性
可读的 xgb 树
model.get_booster().dump_model('xgbclf.text')
可分析的 xgb 树
注: 需要 0.82 及之后的版本
xgbtree = model.get_booster().trees_to_dataframe()
单样本解析
模型的特征重要度,单个样本的预测概率和特征重要度都可以从下方函数中得到
def _xgb_tree_leaf_parse(xgbtree,nodeid_leaf): '''给定叶子节点,查找 xgbtree 树的路径 ''' leaf_ind=list(nodeid_leaf) result=xgbtree.loc[(xgbtree.ID.isin(leaf_ind)),:] result['Tag']='Leaf' node_id=list(result.ID) while len(node_id)>0: tmp1=xgbtree.loc[(xgbtree.Yes.isin(node_id)),:] tmp2=xgbtree.loc[(xgbtree.No.isin(node_id)),:] tmp1['Tag']='Yes' tmp2['Tag']='No' node_id=list(tmp1.ID)+list(tmp2.ID) result=pd.concat([result,tmp1,tmp2],axis=0) return resultdef xgb_parse(model,feature=None): '''给定模型和单个样本,返回该样本的xgbtree树路径以及该样本的特征重要度 ''' feature_names=model.get_booster().feature_names #missing_value=model.get_params()['missing'] f0=pd.DataFrame({'GainTotal':model.feature_importances_,'Feature':feature_names}) f0=f0[['Feature','GainTotal']] xgbtree=model.get_booster().trees_to_dataframe() if feature is None: return xgbtree,f0.sort_values(by='GainTotal',ascending=False).reset_index(drop=True) ind=model.get_booster().predict(xgb.DMatrix(feature),validate_features=False,pred_leaf=True)[0] ind=pd.Series(np.arange(model.n_estimators)).astype(np.str)+'-'+pd.Series(ind).astype(np.str) result=_xgb_tree_leaf_parse(xgbtree,ind) result=result.sort_values(by=['Tree','Node']) loc=int(np.where(result.columns=='Feature')[0][0])+1 result.insert(loc,'FeatureValue',result.Feature.replace(feature.to_dict())) #result.loc[(result.FeatureValue==missing_value)|(result.FeatureValue.isnull()),'Tag']='Missing' result=result[['Tree','Node','ID','Feature','FeatureValue','Split','Tag','Yes','No','Missing','Gain','Cover']] f_=result.groupby('Feature')['Gain'].mean().drop('Leaf',axis=0) f_=f_/f_.sum() f_=pd.DataFrame(f_).reset_index() f=pd.merge(f0[['Feature','GainTotal']],f_[['Feature','Gain']],on='Feature',how='left').fillna(0) f['diff']=np.round((f['Gain']-f['GainTotal'])/f['GainTotal'],2) f=f[['Feature','Gain','GainTotal','diff']].sort_values(by='Gain',ascending=False).reset_index(drop=True) return result,f
ps:哈哈,没想到我也有流量变现的资格啦,就是那个文中广告真的很影响用户体验。。。