3 特征构造
学习目标
-
知道未来信息的概念,及处理未来信息的方法
-
掌握从原始数据构造出新特征的方法
-
掌握特征变换的方法
-
掌握缺失值处理的方法
1 数据准备
1.1 梳理数据的内在逻辑
关系种类
一对一:一个用户有一个注册手机号
一对多:一个用户多笔借款
多对多:一个用户可以登录多个设备,一个设备可以有多个用户登录
-
举例
下图中,蓝色框为二月当期账单,红色框为订单
梳理类ER图
-
任务:分析厚数据常登陆首单用户的逾期情况
-
可以将表结构展示到特征文档当中,说明取数逻辑
-
1.2 样本设计和特征框架
-
定义观察期样本
-
确定观察期(定X时间切面)和表现期(定Y的标签)
-
确认样本数目是否合理
-
-
数据EDA
-
看数据总体分布
-
data.shape
-
data.isnull()
-
data.info()
-
data.describe()
-
-
看好坏样本分布差异
-
data[data[label]==1].describe() 坏用户
-
-
看单个数据
-
data.sample(n=10,random_state=1)
-
-
-
梳理特征框架
-
RFM生成新特征
-
-
data[data[label]==0].describe() 好用户
-
举例 行为评分卡中的用户账单还款特征
-
用户账单关键信息:时间,金额,还款,额度
-
-
小结:在构建特征前,要完成
-
类ER图
-
样本设计表
-
特征框架表
-
2 特征构造
2.1 静态信息特征和时间截面特征
-
用户静态信息特征
-
用户的基本信息(半年内不会变化)
-
-
用户时间截面特征
-
未来信息当前时间截面之后的数据
-
时间截面数据在取数的时候要小心,避免使用未来信息
-
产生未来信息最直接的原因:缺少快照表
-
快照表,每天照个相片 23:00点 把当天的数据 备份一份
-
快照表只会保存当天最终的状态
日志表,每一次操作都记一次, 不会update,只有insert 操作一次记录一次
-
每一次操作都会记录下来
-
-
-
金融相关数据原则上都需要快照表记录所有痕迹(额度变化情况,多次申请的通过和拒绝情况...)
-
缺少快照表的可能原因
-
快照表消耗资源比较大,为了性能不做
-
原有数据表设计人员疏忽,没做
-
借用其他业务数据(如电商)做信贷
-
-
举例
首次借贷 二次借贷 爬虫授权 三次借贷
——————————————————————→
用户 借款 授权爬虫 逾期 u1 l11 N 0 u1 l12 N 0 u1 l13 Y 0 u2 l21 N 0 u2 l22 N 0 u2 l23 Y 1 u3 l11 N 0 u3 l12 N 0 u3 l13 Y 0 实际存储
用户 授权 u1 Y u2 Y u3 Y 用户 借款 逾期 u1 l11 0 u1 l12 0 u1 l13 0 u2 l21 0 u2 l22 0 u2 l23 1 u3 l11 0 u3 l12 0 u3 l13 0 join 结果
用户 借款 授权爬虫 逾期 u1 l11 Y 0 u1 l12 Y 0 u1 l13 Y 0 u2 l21 Y 0 u2 l22 Y 0 u2 l23 Y 1 u3 l11 Y 0 u3 l12 Y 0 u3 l13 Y 0 解决方案:加入快照的存储
用户 授权 时间 u1 Y t3 u2 Y t3 u3 Y t3
-
2.2 时间序列特征
用户时间序列特征
-
从观察点往前回溯一段时间的数据
时间序列特征衍生
-
特征聚合:将单个特征的多个时间节点取值进行聚合。特征聚合是传统评分卡建模的主要特征构造方法。
-
举例,计算每个用户的额度使用率,记为特征ft,按照时间轴以月份为切片展开
-
申请前30天内的额度使用率ft1
-
申请前30天至60天内的额度使用率ft2
-
申请前60天至90天内的额度使用率ft3
-
申请前330天至360天内的额度使用率ft12
-
得到一个用户的12个特征
-
<span style="background-color:#f8f8f8"><span style="color:#770088">import</span> <span style="color:#000000">pandas</span> <span style="color:#770088">as</span> <span style="color:#000000">pd</span> <span style="color:#770088">import</span> <span style="color:#000000">numpy</span> <span style="color:#770088">as</span> <span style="color:#000000">np</span> <span style="color:#000000">data</span> = <span style="color:#000000">pd</span>.<span style="color:#000000">read_excel</span>(<span style="color:#aa1111">'data/textdata.xlsx'</span>) <span style="color:#000000">data</span>.<span style="color:#000000">head</span>()</span>
显示结果
customer_id ft1 ft2 ft3 ft4 ft5 ft6 ft7 ft8 ft9 ... gt3 gt4 gt5 gt6 gt7 gt8 gt9 gt10 gt11 gt12 0 111 9 11.0 12 13 18 10 12 NaN NaN ... 10 0 18 10 12 NaN NaN NaN NaN NaN 1 112 11 -11.0 10 10 13 13 10 NaN NaN ... 10 10 13 13 10 NaN NaN NaN NaN NaN 2 113 0 11.0 10 12 6 10 0 25.0 10.0 ... 10 12 6 10 0 25.0 10.0 NaN NaN NaN 3 114 -7 -1.0 9 8 7 0 -19 10.0 11.0 ... 10 10 12 0 -19 10.0 11.0 NaN NaN NaN 4 115 11 NaN 6 10 0 17 19 10.0 30.0 ... 6 10 0 17 19 10.0 30.0 15.0 NaN NaN 5 rows × 26 columns
-
可以根据这个时间序列进行基于经验的人工特征衍生,例如计算最近P个月特征大于0的月份数
<span style="background-color:#f8f8f8"><span style="color:#aa5500">#最近p个月,ft>0的月份数</span> <span style="color:#770088">def</span> <span style="color:#0000ff">Num</span>(<span style="color:#000000">ft</span>,<span style="color:#000000">p</span>): <span style="color:#aa5500">#ft 特征名字 p特征大于0的月份数</span> <span style="color:#000000">df</span>=<span style="color:#000000">data</span>.<span style="color:#000000">loc</span>[:,<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'1'</span>:<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>)] <span style="color:#000000">auto_value</span>=<span style="color:#000000">np</span>.<span style="color:#000000">where</span>(<span style="color:#000000">df</span><span style="color:#981a1a">></span><span style="color:#116644">0</span>,<span style="color:#116644">1</span>,<span style="color:#116644">0</span>).<span style="color:#000000">sum</span>(<span style="color:#000000">axis</span>=<span style="color:#116644">1</span>) <span style="color:#770088">return</span> <span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'_num'</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>),<span style="color:#000000">auto_value</span></span>
-
计算最近P个月特征ft等于0的月份数
<span style="background-color:#f8f8f8"><span style="color:#aa5500">#最近p个月,ft=0的月份数</span> <span style="color:#770088">def</span> <span style="color:#0000ff">zero_cnt</span>(<span style="color:#000000">ft</span>,<span style="color:#000000">p</span>): <span style="color:#000000">df</span>=<span style="color:#000000">data</span>.<span style="color:#000000">loc</span>[:,<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'1'</span>:<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>)] <span style="color:#000000">auto_value</span>=<span style="color:#000000">np</span>.<span style="color:#000000">where</span>(<span style="color:#000000">df</span>==<span style="color:#116644">0</span>,<span style="color:#116644">1</span>,<span style="color:#116644">0</span>).<span style="color:#000000">sum</span>(<span style="color:#000000">axis</span>=<span style="color:#116644">1</span>) <span style="color:#770088">return</span> <span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'_zero_cnt'</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>),<span style="color:#000000">auto_value</span></span>
-
计算近p个月特征ft大于0的月份数是否大于等于1
<span style="background-color:#f8f8f8"><span style="color:#aa5500">#最近p个月,ft>0的月份数是否>=1 </span> <span style="color:#770088">def</span> <span style="color:#0000ff">Evr</span>(<span style="color:#000000">ft</span>,<span style="color:#000000">p</span>): <span style="color:#000000">df</span>=<span style="color:#000000">data</span>.<span style="color:#000000">loc</span>[:,<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'1'</span>:<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>)] <span style="color:#000000">arr</span>=<span style="color:#000000">np</span>.<span style="color:#000000">where</span>(<span style="color:#000000">df</span><span style="color:#981a1a">></span><span style="color:#116644">0</span>,<span style="color:#116644">1</span>,<span style="color:#116644">0</span>).<span style="color:#000000">sum</span>(<span style="color:#000000">axis</span>=<span style="color:#116644">1</span>) <span style="color:#000000">auto_value</span> = <span style="color:#000000">np</span>.<span style="color:#000000">where</span>(<span style="color:#000000">arr</span>,<span style="color:#116644">1</span>,<span style="color:#116644">0</span>) <span style="color:#770088">return</span> <span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'_evr'</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>),<span style="color:#000000">auto_value</span> </span>
-
计算最近p个月特征ft的均值
<span style="background-color:#f8f8f8"><span style="color:#aa5500">#最近p个月,ft均值</span> <span style="color:#770088">def</span> <span style="color:#0000ff">Avg</span>(<span style="color:#000000">ft</span>,<span style="color:#000000">p</span>): <span style="color:#000000">df</span>=<span style="color:#000000">data</span>.<span style="color:#000000">loc</span>[:,<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'1'</span>:<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>)] <span style="color:#000000">auto_value</span>=<span style="color:#000000">np</span>.<span style="color:#000000">nanmean</span>(<span style="color:#000000">df</span>,<span style="color:#000000">axis</span> = <span style="color:#116644">1</span> ) <span style="color:#770088">return</span> <span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'_avg'</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>),<span style="color:#000000">auto_value</span> </span>
-
计算最近p个月特征ft的和,最大值,最小值
<span style="background-color:#f8f8f8"><span style="color:#aa5500">#最近p个月,ft和</span> <span style="color:#770088">def</span> <span style="color:#0000ff">Tot</span>(<span style="color:#000000">ft</span>,<span style="color:#000000">p</span>): <span style="color:#000000">df</span>=<span style="color:#000000">data</span>.<span style="color:#000000">loc</span>[:,<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'1'</span>:<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>)] <span style="color:#000000">auto_value</span>=<span style="color:#000000">np</span>.<span style="color:#000000">nansum</span>(<span style="color:#000000">df</span>,<span style="color:#000000">axis</span> = <span style="color:#116644">1</span>) <span style="color:#770088">return</span> <span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'_tot'</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>),<span style="color:#000000">auto_value</span> <span style="color:#aa5500">#最近(2,p+1)个月,ft和</span> <span style="color:#770088">def</span> <span style="color:#0000ff">Tot2T</span>(<span style="color:#000000">ft</span>,<span style="color:#000000">p</span>): <span style="color:#000000">df</span>=<span style="color:#000000">data</span>.<span style="color:#000000">loc</span>[:,<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'2'</span>:<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span><span style="color:#981a1a">+</span><span style="color:#116644">1</span>)] <span style="color:#000000">auto_value</span>=<span style="color:#000000">df</span>.<span style="color:#000000">sum</span>(<span style="color:#116644">1</span>) <span style="color:#770088">return</span> <span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'_tot2t'</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>),<span style="color:#000000">auto_value</span> <span style="color:#aa5500">#最近p个月,ft最大值</span> <span style="color:#770088">def</span> <span style="color:#0000ff">Max</span>(<span style="color:#000000">ft</span>,<span style="color:#000000">p</span>): <span style="color:#000000">df</span>=<span style="color:#000000">data</span>.<span style="color:#000000">loc</span>[:,<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'1'</span>:<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>)] <span style="color:#000000">auto_value</span>=<span style="color:#000000">np</span>.<span style="color:#000000">nanmax</span>(<span style="color:#000000">df</span>,<span style="color:#000000">axis</span> = <span style="color:#116644">1</span>) <span style="color:#770088">return</span> <span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'_max'</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>),<span style="color:#000000">auto_value</span> <span style="color:#aa5500">#最近p个月,ft最小值</span> <span style="color:#770088">def</span> <span style="color:#0000ff">Min</span>(<span style="color:#000000">ft</span>,<span style="color:#000000">p</span>): <span style="color:#000000">df</span>=<span style="color:#000000">data</span>.<span style="color:#000000">loc</span>[:,<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'1'</span>:<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>)] <span style="color:#000000">auto_value</span>=<span style="color:#000000">np</span>.<span style="color:#000000">nanmin</span>(<span style="color:#000000">df</span>,<span style="color:#000000">axis</span> = <span style="color:#116644">1</span>) <span style="color:#770088">return</span> <span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'_min'</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>),<span style="color:#000000">auto_value</span> </span>
-
其余衍生方法
<span style="background-color:#f8f8f8"><span style="color:#aa5500">#最近p个月,最近一次ft>0到现在的月份数</span> <span style="color:#770088">def</span> <span style="color:#0000ff">Msg</span>(<span style="color:#000000">ft</span>,<span style="color:#000000">p</span>): <span style="color:#000000">df</span>=<span style="color:#000000">data</span>.<span style="color:#000000">loc</span>[:,<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'1'</span>:<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>)] <span style="color:#000000">df_value</span>=<span style="color:#000000">np</span>.<span style="color:#000000">where</span>(<span style="color:#000000">df</span><span style="color:#981a1a">></span><span style="color:#116644">0</span>,<span style="color:#116644">1</span>,<span style="color:#116644">0</span>) <span style="color:#000000">auto_value</span>=[]<span style="color:#000000">kl</span> <span style="color:#770088">for</span> <span style="color:#000000">i</span> <span style="color:#770088">in</span> <span style="color:#3300aa">range</span>(<span style="color:#3300aa">len</span>(<span style="color:#000000">df_value</span>)): <span style="color:#000000">row_value</span>=<span style="color:#000000">df_value</span>[<span style="color:#000000">i</span>,:] <span style="color:#770088">if</span> <span style="color:#000000">row_value</span>.<span style="color:#000000">max</span>()<span style="color:#981a1a"><</span>=<span style="color:#116644">0</span>: <span style="color:#000000">indexs</span>=<span style="color:#aa1111">'0'</span> <span style="color:#000000">auto_value</span>.<span style="color:#000000">append</span>(<span style="color:#000000">indexs</span>) <span style="color:#770088">else</span>: <span style="color:#000000">indexs</span>=<span style="color:#116644">1</span> <span style="color:#770088">for</span> <span style="color:#000000">j</span> <span style="color:#770088">in</span> <span style="color:#000000">row_value</span>: <span style="color:#770088">if</span> <span style="color:#000000">j</span><span style="color:#981a1a">></span><span style="color:#116644">0</span>: <span style="color:#770088">break</span> <span style="color:#000000">indexs</span>+=<span style="color:#116644">1</span> <span style="color:#000000">auto_value</span>.<span style="color:#000000">append</span>(<span style="color:#000000">indexs</span>) <span style="color:#770088">return</span> <span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'_msg'</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>),<span style="color:#000000">auto_value</span> <span style="color:#aa5500">#最近p个月,最近一次ft=0到现在的月份数</span> <span style="color:#770088">def</span> <span style="color:#0000ff">Msz</span>(<span style="color:#000000">ft</span>,<span style="color:#000000">p</span>): <span style="color:#000000">df</span>=<span style="color:#000000">data</span>.<span style="color:#000000">loc</span>[:,<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'1'</span>:<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>)] <span style="color:#000000">df_value</span>=<span style="color:#000000">np</span>.<span style="color:#000000">where</span>(<span style="color:#000000">df</span>==<span style="color:#116644">0</span>,<span style="color:#116644">1</span>,<span style="color:#116644">0</span>) <span style="color:#000000">auto_value</span>=[] <span style="color:#770088">for</span> <span style="color:#000000">i</span> <span style="color:#770088">in</span> <span style="color:#3300aa">range</span>(<span style="color:#3300aa">len</span>(<span style="color:#000000">df_value</span>)): <span style="color:#000000">row_value</span>=<span style="color:#000000">df_value</span>[<span style="color:#000000">i</span>,:] <span style="color:#770088">if</span> <span style="color:#000000">row_value</span>.<span style="color:#000000">max</span>()<span style="color:#981a1a"><</span>=<span style="color:#116644">0</span>: <span style="color:#000000">indexs</span>=<span style="color:#aa1111">'0'</span> <span style="color:#000000">auto_value</span>.<span style="color:#000000">append</span>(<span style="color:#000000">indexs</span>) <span style="color:#770088">else</span>: <span style="color:#000000">indexs</span>=<span style="color:#116644">1</span> <span style="color:#770088">for</span> <span style="color:#000000">j</span> <span style="color:#770088">in</span> <span style="color:#000000">row_value</span>: <span style="color:#770088">if</span> <span style="color:#000000">j</span><span style="color:#981a1a">></span><span style="color:#116644">0</span>: <span style="color:#770088">break</span> <span style="color:#000000">indexs</span>+=<span style="color:#116644">1</span> <span style="color:#000000">auto_value</span>.<span style="color:#000000">append</span>(<span style="color:#000000">indexs</span>) <span style="color:#770088">return</span> <span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'_msz'</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>),<span style="color:#000000">auto_value</span> <span style="color:#aa5500">#当月ft/(最近p个月ft的均值)</span> <span style="color:#770088">def</span> <span style="color:#0000ff">Cav</span>(<span style="color:#000000">ft</span>,<span style="color:#000000">p</span>): <span style="color:#000000">df</span>=<span style="color:#000000">data</span>.<span style="color:#000000">loc</span>[:,<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'1'</span>:<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>)] <span style="color:#000000">auto_value</span> = <span style="color:#000000">df</span>[<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'1'</span>]<span style="color:#981a1a">/</span><span style="color:#000000">np</span>.<span style="color:#000000">nanmean</span>(<span style="color:#000000">df</span>,<span style="color:#000000">axis</span> = <span style="color:#116644">1</span> ) <span style="color:#770088">return</span> <span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'_cav'</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>),<span style="color:#000000">auto_value</span> <span style="color:#aa5500">#当月ft/(最近p个月ft的最小值)</span> <span style="color:#770088">def</span> <span style="color:#0000ff">Cmn</span>(<span style="color:#000000">ft</span>,<span style="color:#000000">p</span>): <span style="color:#000000">df</span>=<span style="color:#000000">data</span>.<span style="color:#000000">loc</span>[:,<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'1'</span>:<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>)] <span style="color:#000000">auto_value</span> = <span style="color:#000000">df</span>[<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'1'</span>]<span style="color:#981a1a">/</span><span style="color:#000000">np</span>.<span style="color:#000000">nanmin</span>(<span style="color:#000000">df</span>,<span style="color:#000000">axis</span> = <span style="color:#116644">1</span> ) <span style="color:#770088">return</span> <span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'_cmn'</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>),<span style="color:#000000">auto_value</span> <span style="color:#aa5500">#最近p个月,每两个月间的ft的增长量的最大值</span> <span style="color:#770088">def</span> <span style="color:#0000ff">Mai</span>(<span style="color:#000000">ft</span>,<span style="color:#000000">p</span>): <span style="color:#000000">arr</span>=<span style="color:#000000">np</span>.<span style="color:#000000">array</span>(<span style="color:#000000">data</span>.<span style="color:#000000">loc</span>[:,<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'1'</span>:<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>)]) <span style="color:#000000">auto_value</span> = [] <span style="color:#770088">for</span> <span style="color:#000000">i</span> <span style="color:#770088">in</span> <span style="color:#3300aa">range</span>(<span style="color:#3300aa">len</span>(<span style="color:#000000">arr</span>)): <span style="color:#000000">df_value</span> = <span style="color:#000000">arr</span>[<span style="color:#000000">i</span>,:] <span style="color:#000000">value_lst</span> = [] <span style="color:#770088">for</span> <span style="color:#000000">k</span> <span style="color:#770088">in</span> <span style="color:#3300aa">range</span>(<span style="color:#3300aa">len</span>(<span style="color:#000000">df_value</span>)<span style="color:#981a1a">-</span><span style="color:#116644">1</span>): <span style="color:#000000">minus</span> = <span style="color:#000000">df_value</span>[<span style="color:#000000">k</span>] <span style="color:#981a1a">-</span> <span style="color:#000000">df_value</span>[<span style="color:#000000">k</span><span style="color:#981a1a">+</span><span style="color:#116644">1</span>] <span style="color:#000000">value_lst</span>.<span style="color:#000000">append</span>(<span style="color:#000000">minus</span>) <span style="color:#000000">auto_value</span>.<span style="color:#000000">append</span>(<span style="color:#000000">np</span>.<span style="color:#000000">nanmax</span>(<span style="color:#000000">value_lst</span>)) <span style="color:#770088">return</span> <span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'_mai'</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>),<span style="color:#000000">auto_value</span> <span style="color:#aa5500">#最近p个月,每两个月间的ft的减少量的最大值</span> <span style="color:#770088">def</span> <span style="color:#0000ff">Mad</span>(<span style="color:#000000">ft</span>,<span style="color:#000000">p</span>): <span style="color:#000000">arr</span>=<span style="color:#000000">np</span>.<span style="color:#000000">array</span>(<span style="color:#000000">data</span>.<span style="color:#000000">loc</span>[:,<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'1'</span>:<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>)]) <span style="color:#000000">auto_value</span> = [] <span style="color:#770088">for</span> <span style="color:#000000">i</span> <span style="color:#770088">in</span> <span style="color:#3300aa">range</span>(<span style="color:#3300aa">len</span>(<span style="color:#000000">arr</span>)): <span style="color:#000000">df_value</span> = <span style="color:#000000">arr</span>[<span style="color:#000000">i</span>,:] <span style="color:#000000">value_lst</span> = [] <span style="color:#770088">for</span> <span style="color:#000000">k</span> <span style="color:#770088">in</span> <span style="color:#3300aa">range</span>(<span style="color:#3300aa">len</span>(<span style="color:#000000">df_value</span>)<span style="color:#981a1a">-</span><span style="color:#116644">1</span>): <span style="color:#000000">minus</span> = <span style="color:#000000">df_value</span>[<span style="color:#000000">k</span><span style="color:#981a1a">+</span><span style="color:#116644">1</span>] <span style="color:#981a1a">-</span> <span style="color:#000000">df_value</span>[<span style="color:#000000">k</span>] <span style="color:#000000">value_lst</span>.<span style="color:#000000">append</span>(<span style="color:#000000">minus</span>) <span style="color:#000000">auto_value</span>.<span style="color:#000000">append</span>(<span style="color:#000000">np</span>.<span style="color:#000000">nanmax</span>(<span style="color:#000000">value_lst</span>)) <span style="color:#770088">return</span> <span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'_mad'</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>),<span style="color:#000000">auto_value</span> <span style="color:#aa5500">#最近p个月,ft的标准差</span> <span style="color:#770088">def</span> <span style="color:#0000ff">Std</span>(<span style="color:#000000">ft</span>,<span style="color:#000000">p</span>): <span style="color:#000000">df</span>=<span style="color:#000000">data</span>.<span style="color:#000000">loc</span>[:,<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'1'</span>:<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>)] <span style="color:#000000">auto_value</span>=<span style="color:#000000">np</span>.<span style="color:#000000">nanvar</span>(<span style="color:#000000">df</span>,<span style="color:#000000">axis</span> = <span style="color:#116644">1</span>) <span style="color:#770088">return</span> <span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'_std'</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>),<span style="color:#000000">auto_value</span> <span style="color:#aa5500">#最近p个月,ft的变异系数</span> <span style="color:#770088">def</span> <span style="color:#0000ff">Cva</span>(<span style="color:#000000">ft</span>,<span style="color:#000000">p</span>): <span style="color:#000000">df</span>=<span style="color:#000000">data</span>.<span style="color:#000000">loc</span>[:,<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'1'</span>:<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>)] <span style="color:#000000">auto_value</span>=<span style="color:#000000">np</span>.<span style="color:#000000">nanvar</span>(<span style="color:#000000">df</span>,<span style="color:#000000">axis</span> = <span style="color:#116644">1</span>)<span style="color:#981a1a">/</span>(<span style="color:#000000">np</span>.<span style="color:#000000">nanmean</span>(<span style="color:#000000">df</span>,<span style="color:#000000">axis</span> = <span style="color:#116644">1</span> )<span style="color:#981a1a">+</span><span style="color:#116644">1e-10</span>) <span style="color:#770088">return</span> <span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'_cva'</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>),<span style="color:#000000">auto_value</span> <span style="color:#aa5500">#(当月ft) - (最近p个月ft的均值)</span> <span style="color:#770088">def</span> <span style="color:#0000ff">Cmm</span>(<span style="color:#000000">ft</span>,<span style="color:#000000">p</span>): <span style="color:#000000">df</span>=<span style="color:#000000">data</span>.<span style="color:#000000">loc</span>[:,<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'1'</span>:<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>)] <span style="color:#000000">auto_value</span> = <span style="color:#000000">df</span>[<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'1'</span>] <span style="color:#981a1a">-</span> <span style="color:#000000">np</span>.<span style="color:#000000">nanmean</span>(<span style="color:#000000">df</span>,<span style="color:#000000">axis</span> = <span style="color:#116644">1</span> ) <span style="color:#770088">return</span> <span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'_cmm'</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>),<span style="color:#000000">auto_value</span> <span style="color:#aa5500">#(当月ft) - (最近p个月ft的最小值)</span> <span style="color:#770088">def</span> <span style="color:#0000ff">Cnm</span>(<span style="color:#000000">ft</span>,<span style="color:#000000">p</span>): <span style="color:#000000">df</span>=<span style="color:#000000">data</span>.<span style="color:#000000">loc</span>[:,<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'1'</span>:<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>)] <span style="color:#000000">auto_value</span> = <span style="color:#000000">df</span>[<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'1'</span>] <span style="color:#981a1a">-</span> <span style="color:#000000">np</span>.<span style="color:#000000">nanmin</span>(<span style="color:#000000">df</span>,<span style="color:#000000">axis</span> = <span style="color:#116644">1</span> ) <span style="color:#770088">return</span> <span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'_cnm'</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>),<span style="color:#000000">auto_value</span> <span style="color:#aa5500">#(当月ft) - (最近p个月ft的最大值)</span> <span style="color:#770088">def</span> <span style="color:#0000ff">Cxm</span>(<span style="color:#000000">ft</span>,<span style="color:#000000">p</span>): <span style="color:#000000">df</span>=<span style="color:#000000">data</span>.<span style="color:#000000">loc</span>[:,<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'1'</span>:<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>)] <span style="color:#000000">auto_value</span> = <span style="color:#000000">df</span>[<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'1'</span>] <span style="color:#981a1a">-</span> <span style="color:#000000">np</span>.<span style="color:#000000">nanmax</span>(<span style="color:#000000">df</span>,<span style="color:#000000">axis</span> = <span style="color:#116644">1</span> ) <span style="color:#770088">return</span> <span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'_cxm'</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>),<span style="color:#000000">auto_value</span> <span style="color:#aa5500">#( (当月ft) - (最近p个月ft的最大值) ) / (最近p个月ft的最大值) )</span> <span style="color:#770088">def</span> <span style="color:#0000ff">Cxp</span>(<span style="color:#000000">ft</span>,<span style="color:#000000">p</span>): <span style="color:#000000">df</span>=<span style="color:#000000">data</span>.<span style="color:#000000">loc</span>[:,<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'1'</span>:<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>)] <span style="color:#000000">temp</span> = <span style="color:#000000">np</span>.<span style="color:#000000">nanmax</span>(<span style="color:#000000">df</span>,<span style="color:#000000">axis</span> = <span style="color:#116644">1</span> ) <span style="color:#000000">auto_value</span> = (<span style="color:#000000">df</span>[<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'1'</span>] <span style="color:#981a1a">-</span> <span style="color:#000000">temp</span> )<span style="color:#981a1a">/</span> <span style="color:#000000">temp</span> <span style="color:#770088">return</span> <span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'_cxp'</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>),<span style="color:#000000">auto_value</span> <span style="color:#aa5500">#最近p个月,ft的极差</span> <span style="color:#770088">def</span> <span style="color:#0000ff">Ran</span>(<span style="color:#000000">ft</span>,<span style="color:#000000">p</span>): <span style="color:#000000">df</span>=<span style="color:#000000">data</span>.<span style="color:#000000">loc</span>[:,<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'1'</span>:<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>)] <span style="color:#000000">auto_value</span> = <span style="color:#000000">np</span>.<span style="color:#000000">nanmax</span>(<span style="color:#000000">df</span>,<span style="color:#000000">axis</span> = <span style="color:#116644">1</span> ) <span style="color:#981a1a">-</span> <span style="color:#000000">np</span>.<span style="color:#000000">nanmin</span>(<span style="color:#000000">df</span>,<span style="color:#000000">axis</span> = <span style="color:#116644">1</span> ) <span style="color:#770088">return</span> <span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'_ran'</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>),<span style="color:#000000">auto_value</span> <span style="color:#aa5500">#最近p个月中,特征ft的值,后一个月相比于前一个月增长了的月份数</span> <span style="color:#770088">def</span> <span style="color:#0000ff">Nci</span>(<span style="color:#000000">ft</span>,<span style="color:#000000">p</span>): <span style="color:#000000">arr</span>=<span style="color:#000000">np</span>.<span style="color:#000000">array</span>(<span style="color:#000000">data</span>.<span style="color:#000000">loc</span>[:,<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'1'</span>:<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>)]) <span style="color:#000000">auto_value</span> = [] <span style="color:#770088">for</span> <span style="color:#000000">i</span> <span style="color:#770088">in</span> <span style="color:#3300aa">range</span>(<span style="color:#3300aa">len</span>(<span style="color:#000000">arr</span>)): <span style="color:#000000">df_value</span> = <span style="color:#000000">arr</span>[<span style="color:#000000">i</span>,:] <span style="color:#000000">value_lst</span> = [] <span style="color:#770088">for</span> <span style="color:#000000">k</span> <span style="color:#770088">in</span> <span style="color:#3300aa">range</span>(<span style="color:#3300aa">len</span>(<span style="color:#000000">df_value</span>)<span style="color:#981a1a">-</span><span style="color:#116644">1</span>): <span style="color:#000000">minus</span> = <span style="color:#000000">df_value</span>[<span style="color:#000000">k</span>] <span style="color:#981a1a">-</span> <span style="color:#000000">df_value</span>[<span style="color:#000000">k</span><span style="color:#981a1a">+</span><span style="color:#116644">1</span>] <span style="color:#000000">value_lst</span>.<span style="color:#000000">append</span>(<span style="color:#000000">minus</span>) <span style="color:#000000">value_ng</span> = <span style="color:#000000">np</span>.<span style="color:#000000">where</span>(<span style="color:#000000">np</span>.<span style="color:#000000">array</span>(<span style="color:#000000">value_lst</span>)<span style="color:#981a1a">></span><span style="color:#116644">0</span>,<span style="color:#116644">1</span>,<span style="color:#116644">0</span>).<span style="color:#000000">sum</span>() <span style="color:#000000">auto_value</span>.<span style="color:#000000">append</span>(<span style="color:#000000">np</span>.<span style="color:#000000">nanmax</span>(<span style="color:#000000">value_ng</span>)) <span style="color:#770088">return</span> <span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'_nci'</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>),<span style="color:#000000">auto_value</span> <span style="color:#aa5500">#最近p个月中,特征ft的值,后一个月相比于前一个月减少了的月份数</span> <span style="color:#770088">def</span> <span style="color:#0000ff">Ncd</span>(<span style="color:#000000">ft</span>,<span style="color:#000000">p</span>): <span style="color:#000000">arr</span>=<span style="color:#000000">np</span>.<span style="color:#000000">array</span>(<span style="color:#000000">data</span>.<span style="color:#000000">loc</span>[:,<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'1'</span>:<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>)]) <span style="color:#000000">auto_value</span> = [] <span style="color:#770088">for</span> <span style="color:#000000">i</span> <span style="color:#770088">in</span> <span style="color:#3300aa">range</span>(<span style="color:#3300aa">len</span>(<span style="color:#000000">arr</span>)): <span style="color:#000000">df_value</span> = <span style="color:#000000">arr</span>[<span style="color:#000000">i</span>,:] <span style="color:#000000">value_lst</span> = [] <span style="color:#770088">for</span> <span style="color:#000000">k</span> <span style="color:#770088">in</span> <span style="color:#3300aa">range</span>(<span style="color:#3300aa">len</span>(<span style="color:#000000">df_value</span>)<span style="color:#981a1a">-</span><span style="color:#116644">1</span>): <span style="color:#000000">minus</span> = <span style="color:#000000">df_value</span>[<span style="color:#000000">k</span>] <span style="color:#981a1a">-</span> <span style="color:#000000">df_value</span>[<span style="color:#000000">k</span><span style="color:#981a1a">+</span><span style="color:#116644">1</span>] <span style="color:#000000">value_lst</span>.<span style="color:#000000">append</span>(<span style="color:#000000">minus</span>) <span style="color:#000000">value_ng</span> = <span style="color:#000000">np</span>.<span style="color:#000000">where</span>(<span style="color:#000000">np</span>.<span style="color:#000000">array</span>(<span style="color:#000000">value_lst</span>)<span style="color:#981a1a"><</span><span style="color:#116644">0</span>,<span style="color:#116644">1</span>,<span style="color:#116644">0</span>).<span style="color:#000000">sum</span>() <span style="color:#000000">auto_value</span>.<span style="color:#000000">append</span>(<span style="color:#000000">np</span>.<span style="color:#000000">nanmax</span>(<span style="color:#000000">value_ng</span>)) <span style="color:#770088">return</span> <span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'_ncd'</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>),<span style="color:#000000">auto_value</span> <span style="color:#aa5500">#最近p个月中,相邻月份ft 相等的月份数</span> <span style="color:#770088">def</span> <span style="color:#0000ff">Ncn</span>(<span style="color:#000000">ft</span>,<span style="color:#000000">p</span>): <span style="color:#000000">arr</span>=<span style="color:#000000">np</span>.<span style="color:#000000">array</span>(<span style="color:#000000">data</span>.<span style="color:#000000">loc</span>[:,<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'1'</span>:<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>)]) <span style="color:#000000">auto_value</span> = [] <span style="color:#770088">for</span> <span style="color:#000000">i</span> <span style="color:#770088">in</span> <span style="color:#3300aa">range</span>(<span style="color:#3300aa">len</span>(<span style="color:#000000">arr</span>)): <span style="color:#000000">df_value</span> = <span style="color:#000000">arr</span>[<span style="color:#000000">i</span>,:] <span style="color:#000000">value_lst</span> = [] <span style="color:#770088">for</span> <span style="color:#000000">k</span> <span style="color:#770088">in</span> <span style="color:#3300aa">range</span>(<span style="color:#3300aa">len</span>(<span style="color:#000000">df_value</span>)<span style="color:#981a1a">-</span><span style="color:#116644">1</span>): <span style="color:#000000">minus</span> = <span style="color:#000000">df_value</span>[<span style="color:#000000">k</span>] <span style="color:#981a1a">-</span> <span style="color:#000000">df_value</span>[<span style="color:#000000">k</span><span style="color:#981a1a">+</span><span style="color:#116644">1</span>] <span style="color:#000000">value_lst</span>.<span style="color:#000000">append</span>(<span style="color:#000000">minus</span>) <span style="color:#000000">value_ng</span> = <span style="color:#000000">np</span>.<span style="color:#000000">where</span>(<span style="color:#000000">np</span>.<span style="color:#000000">array</span>(<span style="color:#000000">value_lst</span>)==<span style="color:#116644">0</span>,<span style="color:#116644">1</span>,<span style="color:#116644">0</span>).<span style="color:#000000">sum</span>() <span style="color:#000000">auto_value</span>.<span style="color:#000000">append</span>(<span style="color:#000000">np</span>.<span style="color:#000000">nanmax</span>(<span style="color:#000000">value_ng</span>)) <span style="color:#770088">return</span> <span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'_ncn'</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>),<span style="color:#000000">auto_value</span> <span style="color:#aa5500">#最近P个月中,特征ft的值是否按月份严格递增,是返回1,否返回0</span> <span style="color:#770088">def</span> <span style="color:#0000ff">Bup</span>(<span style="color:#000000">ft</span>,<span style="color:#000000">p</span>): <span style="color:#000000">arr</span>=<span style="color:#000000">np</span>.<span style="color:#000000">array</span>(<span style="color:#000000">data</span>.<span style="color:#000000">loc</span>[:,<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'1'</span>:<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>)]) <span style="color:#000000">auto_value</span> = [] <span style="color:#770088">for</span> <span style="color:#000000">i</span> <span style="color:#770088">in</span> <span style="color:#3300aa">range</span>(<span style="color:#3300aa">len</span>(<span style="color:#000000">arr</span>)): <span style="color:#000000">df_value</span> = <span style="color:#000000">arr</span>[<span style="color:#000000">i</span>,:] <span style="color:#000000">value_lst</span> = [] <span style="color:#000000">index</span> = <span style="color:#116644">0</span> <span style="color:#770088">for</span> <span style="color:#000000">k</span> <span style="color:#770088">in</span> <span style="color:#3300aa">range</span>(<span style="color:#3300aa">len</span>(<span style="color:#000000">df_value</span>)<span style="color:#981a1a">-</span><span style="color:#116644">1</span>): <span style="color:#770088">if</span> <span style="color:#000000">df_value</span>[<span style="color:#000000">k</span>] <span style="color:#981a1a">></span> <span style="color:#000000">df_value</span>[<span style="color:#000000">k</span><span style="color:#981a1a">+</span><span style="color:#116644">1</span>]: <span style="color:#770088">break</span> <span style="color:#000000">index</span> =<span style="color:#981a1a">+</span> <span style="color:#116644">1</span> <span style="color:#770088">if</span> <span style="color:#000000">index</span> == <span style="color:#000000">p</span>: <span style="color:#000000">value</span>= <span style="color:#116644">1</span> <span style="color:#770088">else</span>: <span style="color:#000000">value</span> = <span style="color:#116644">0</span> <span style="color:#000000">auto_value</span>.<span style="color:#000000">append</span>(<span style="color:#000000">value</span>) <span style="color:#770088">return</span> <span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'_bup'</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>),<span style="color:#000000">auto_value</span> <span style="color:#aa5500">#最近P个月中,特征ft的值是否按月份严格递减,是返回1,否返回0</span> <span style="color:#770088">def</span> <span style="color:#0000ff">Pdn</span>(<span style="color:#000000">ft</span>,<span style="color:#000000">p</span>): <span style="color:#000000">arr</span>=<span style="color:#000000">np</span>.<span style="color:#000000">array</span>(<span style="color:#000000">data</span>.<span style="color:#000000">loc</span>[:,<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'1'</span>:<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>)]) <span style="color:#000000">auto_value</span> = [] <span style="color:#770088">for</span> <span style="color:#000000">i</span> <span style="color:#770088">in</span> <span style="color:#3300aa">range</span>(<span style="color:#3300aa">len</span>(<span style="color:#000000">arr</span>)): <span style="color:#000000">df_value</span> = <span style="color:#000000">arr</span>[<span style="color:#000000">i</span>,:] <span style="color:#000000">value_lst</span> = [] <span style="color:#000000">index</span> = <span style="color:#116644">0</span> <span style="color:#770088">for</span> <span style="color:#000000">k</span> <span style="color:#770088">in</span> <span style="color:#3300aa">range</span>(<span style="color:#3300aa">len</span>(<span style="color:#000000">df_value</span>)<span style="color:#981a1a">-</span><span style="color:#116644">1</span>): <span style="color:#770088">if</span> <span style="color:#000000">df_value</span>[<span style="color:#000000">k</span><span style="color:#981a1a">+</span><span style="color:#116644">1</span>] <span style="color:#981a1a">></span> <span style="color:#000000">df_value</span>[<span style="color:#000000">k</span>]: <span style="color:#770088">break</span> <span style="color:#000000">index</span> =<span style="color:#981a1a">+</span> <span style="color:#116644">1</span> <span style="color:#770088">if</span> <span style="color:#000000">index</span> == <span style="color:#000000">p</span>: <span style="color:#000000">value</span>= <span style="color:#116644">1</span> <span style="color:#770088">else</span>: <span style="color:#000000">value</span> = <span style="color:#116644">0</span> <span style="color:#000000">auto_value</span>.<span style="color:#000000">append</span>(<span style="color:#000000">value</span>) <span style="color:#770088">return</span> <span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'_pdn'</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>),<span style="color:#000000">auto_value</span> <span style="color:#aa5500">#最近P个月中,ft的切尾均值,这里去掉了数据中的最大值和最小值</span> <span style="color:#770088">def</span> <span style="color:#0000ff">Trm</span>(<span style="color:#000000">ft</span>,<span style="color:#000000">p</span>): <span style="color:#000000">df</span>=<span style="color:#000000">data</span>.<span style="color:#000000">loc</span>[:,<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'1'</span>:<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>)] <span style="color:#000000">auto_value</span> = [] <span style="color:#770088">for</span> <span style="color:#000000">i</span> <span style="color:#770088">in</span> <span style="color:#3300aa">range</span>(<span style="color:#3300aa">len</span>(<span style="color:#000000">df</span>)): <span style="color:#000000">trm_mean</span> = <span style="color:#3300aa">list</span>(<span style="color:#000000">df</span>.<span style="color:#000000">loc</span>[<span style="color:#000000">i</span>,:]) <span style="color:#000000">trm_mean</span>.<span style="color:#000000">remove</span>(<span style="color:#000000">np</span>.<span style="color:#000000">nanmax</span>(<span style="color:#000000">trm_mean</span>)) <span style="color:#000000">trm_mean</span>.<span style="color:#000000">remove</span>(<span style="color:#000000">np</span>.<span style="color:#000000">nanmin</span>(<span style="color:#000000">trm_mean</span>)) <span style="color:#000000">temp</span>=<span style="color:#000000">np</span>.<span style="color:#000000">nanmean</span>(<span style="color:#000000">trm_mean</span>) <span style="color:#000000">auto_value</span>.<span style="color:#000000">append</span>(<span style="color:#000000">temp</span>) <span style="color:#770088">return</span> <span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'_trm'</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>),<span style="color:#000000">auto_value</span> <span style="color:#aa5500">#当月ft / 最近p个月的ft中的最大值</span> <span style="color:#770088">def</span> <span style="color:#0000ff">Cmx</span>(<span style="color:#000000">ft</span>,<span style="color:#000000">p</span>): <span style="color:#000000">df</span>=<span style="color:#000000">data</span>.<span style="color:#000000">loc</span>[:,<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'1'</span>:<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>)] <span style="color:#000000">auto_value</span> = (<span style="color:#000000">df</span>[<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'1'</span>] <span style="color:#981a1a">-</span> <span style="color:#000000">np</span>.<span style="color:#000000">nanmax</span>(<span style="color:#000000">df</span>,<span style="color:#000000">axis</span> = <span style="color:#116644">1</span> )) <span style="color:#981a1a">/</span><span style="color:#000000">np</span>.<span style="color:#000000">nanmax</span>(<span style="color:#000000">df</span>,<span style="color:#000000">axis</span> = <span style="color:#116644">1</span> ) <span style="color:#770088">return</span> <span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'_cmx'</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>),<span style="color:#000000">auto_value</span> <span style="color:#aa5500">#( 当月ft - 最近p个月的ft均值 ) / ft均值</span> <span style="color:#770088">def</span> <span style="color:#0000ff">Cmp</span>(<span style="color:#000000">ft</span>,<span style="color:#000000">p</span>): <span style="color:#000000">df</span>=<span style="color:#000000">data</span>.<span style="color:#000000">loc</span>[:,<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'1'</span>:<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>)] <span style="color:#000000">auto_value</span> = (<span style="color:#000000">df</span>[<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'1'</span>] <span style="color:#981a1a">-</span> <span style="color:#000000">np</span>.<span style="color:#000000">nanmean</span>(<span style="color:#000000">df</span>,<span style="color:#000000">axis</span> = <span style="color:#116644">1</span> )) <span style="color:#981a1a">/</span><span style="color:#000000">np</span>.<span style="color:#000000">nanmean</span>(<span style="color:#000000">df</span>,<span style="color:#000000">axis</span> = <span style="color:#116644">1</span> ) <span style="color:#770088">return</span> <span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'_cmp'</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>),<span style="color:#000000">auto_value</span> <span style="color:#aa5500">#( 当月ft - 最近p个月的ft最小值 ) /ft最小值 </span> <span style="color:#770088">def</span> <span style="color:#0000ff">Cnp</span>(<span style="color:#000000">ft</span>,<span style="color:#000000">p</span>): <span style="color:#000000">df</span>=<span style="color:#000000">data</span>.<span style="color:#000000">loc</span>[:,<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'1'</span>:<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>)] <span style="color:#000000">auto_value</span> = (<span style="color:#000000">df</span>[<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'1'</span>] <span style="color:#981a1a">-</span> <span style="color:#000000">np</span>.<span style="color:#000000">nanmin</span>(<span style="color:#000000">df</span>,<span style="color:#000000">axis</span> = <span style="color:#116644">1</span> )) <span style="color:#981a1a">/</span><span style="color:#000000">np</span>.<span style="color:#000000">nanmin</span>(<span style="color:#000000">df</span>,<span style="color:#000000">axis</span> = <span style="color:#116644">1</span> ) <span style="color:#770088">return</span> <span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'_cnp'</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>),<span style="color:#000000">auto_value</span> <span style="color:#aa5500">#最近p个月取最大值的月份距现在的月份数</span> <span style="color:#770088">def</span> <span style="color:#0000ff">Msx</span>(<span style="color:#000000">ft</span>,<span style="color:#000000">p</span>): <span style="color:#000000">df</span>=<span style="color:#000000">data</span>.<span style="color:#000000">loc</span>[:,<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'1'</span>:<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>)] <span style="color:#000000">df</span>[<span style="color:#aa1111">'_max'</span>] = <span style="color:#000000">np</span>.<span style="color:#000000">nanmax</span>(<span style="color:#000000">df</span>,<span style="color:#000000">axis</span> = <span style="color:#116644">1</span>) <span style="color:#770088">for</span> <span style="color:#000000">i</span> <span style="color:#770088">in</span> <span style="color:#3300aa">range</span>(<span style="color:#116644">1</span>,<span style="color:#000000">p</span><span style="color:#981a1a">+</span><span style="color:#116644">1</span>): <span style="color:#000000">df</span>[<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">i</span>)] = <span style="color:#3300aa">list</span>(<span style="color:#000000">df</span>[<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">i</span>)] == <span style="color:#000000">df</span>[<span style="color:#aa1111">'_max'</span>]) <span style="color:#770088">del</span> <span style="color:#000000">df</span>[<span style="color:#aa1111">'_max'</span>] <span style="color:#000000">df_value</span> = <span style="color:#000000">np</span>.<span style="color:#000000">where</span>(<span style="color:#000000">df</span>==<span style="color:#770088">True</span>,<span style="color:#116644">1</span>,<span style="color:#116644">0</span>) <span style="color:#000000">auto_value</span>=[] <span style="color:#770088">for</span> <span style="color:#000000">i</span> <span style="color:#770088">in</span> <span style="color:#3300aa">range</span>(<span style="color:#3300aa">len</span>(<span style="color:#000000">df_value</span>)): <span style="color:#000000">row_value</span>=<span style="color:#000000">df_value</span>[<span style="color:#000000">i</span>,:] <span style="color:#000000">indexs</span>=<span style="color:#116644">1</span> <span style="color:#770088">for</span> <span style="color:#000000">j</span> <span style="color:#770088">in</span> <span style="color:#000000">row_value</span>: <span style="color:#770088">if</span> <span style="color:#000000">j</span> == <span style="color:#116644">1</span>: <span style="color:#770088">break</span> <span style="color:#000000">indexs</span>+=<span style="color:#116644">1</span> <span style="color:#000000">auto_value</span>.<span style="color:#000000">append</span>(<span style="color:#000000">indexs</span>) <span style="color:#770088">return</span> <span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'_msx'</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>),<span style="color:#000000">auto_value</span> <span style="color:#aa5500">#最近p个月的均值/((p,2p)个月的ft均值)</span> <span style="color:#770088">def</span> <span style="color:#0000ff">Rpp</span>(<span style="color:#000000">ft</span>,<span style="color:#000000">p</span>): <span style="color:#000000">df1</span>=<span style="color:#000000">data</span>.<span style="color:#000000">loc</span>[:,<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'1'</span>:<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>)] <span style="color:#000000">value1</span>=<span style="color:#000000">np</span>.<span style="color:#000000">nanmean</span>(<span style="color:#000000">df1</span>,<span style="color:#000000">axis</span> = <span style="color:#116644">1</span> ) <span style="color:#000000">df2</span>=<span style="color:#000000">data</span>.<span style="color:#000000">loc</span>[:,<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>):<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#116644">2</span><span style="color:#981a1a">*</span><span style="color:#000000">p</span>)] <span style="color:#000000">value2</span>=<span style="color:#000000">np</span>.<span style="color:#000000">nanmean</span>(<span style="color:#000000">df2</span>,<span style="color:#000000">axis</span> = <span style="color:#116644">1</span> ) <span style="color:#000000">auto_value</span> = <span style="color:#000000">value1</span><span style="color:#981a1a">/</span><span style="color:#000000">value2</span> <span style="color:#770088">return</span> <span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'_rpp'</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>),<span style="color:#000000">auto_value</span> <span style="color:#aa5500">#最近p个月的均值 - ((p,2p)个月的ft均值)</span> <span style="color:#770088">def</span> <span style="color:#0000ff">Dpp</span>(<span style="color:#000000">ft</span>,<span style="color:#000000">p</span>): <span style="color:#000000">df1</span>=<span style="color:#000000">data</span>.<span style="color:#000000">loc</span>[:,<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'1'</span>:<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>)] <span style="color:#000000">value1</span>=<span style="color:#000000">np</span>.<span style="color:#000000">nanmean</span>(<span style="color:#000000">df1</span>,<span style="color:#000000">axis</span> = <span style="color:#116644">1</span> ) <span style="color:#000000">df2</span>=<span style="color:#000000">data</span>.<span style="color:#000000">loc</span>[:,<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>):<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#116644">2</span><span style="color:#981a1a">*</span><span style="color:#000000">p</span>)] <span style="color:#000000">value2</span>=<span style="color:#000000">np</span>.<span style="color:#000000">nanmean</span>(<span style="color:#000000">df2</span>,<span style="color:#000000">axis</span> = <span style="color:#116644">1</span> ) <span style="color:#000000">auto_value</span> = <span style="color:#000000">value1</span> <span style="color:#981a1a">-</span> <span style="color:#000000">value2</span> <span style="color:#770088">return</span> <span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'_dpp'</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>),<span style="color:#000000">auto_value</span> <span style="color:#aa5500">#(最近p个月的ft最大值)/ (最近(p,2p)个月的ft最大值)</span> <span style="color:#770088">def</span> <span style="color:#0000ff">Mpp</span>(<span style="color:#000000">ft</span>,<span style="color:#000000">p</span>): <span style="color:#000000">df1</span>=<span style="color:#000000">data</span>.<span style="color:#000000">loc</span>[:,<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'1'</span>:<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>)] <span style="color:#000000">value1</span>=<span style="color:#000000">np</span>.<span style="color:#000000">nanmax</span>(<span style="color:#000000">df1</span>,<span style="color:#000000">axis</span> = <span style="color:#116644">1</span> ) <span style="color:#000000">df2</span>=<span style="color:#000000">data</span>.<span style="color:#000000">loc</span>[:,<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>):<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#116644">2</span><span style="color:#981a1a">*</span><span style="color:#000000">p</span>)] <span style="color:#000000">value2</span>=<span style="color:#000000">np</span>.<span style="color:#000000">nanmax</span>(<span style="color:#000000">df2</span>,<span style="color:#000000">axis</span> = <span style="color:#116644">1</span> ) <span style="color:#000000">auto_value</span> = <span style="color:#000000">value1</span><span style="color:#981a1a">/</span><span style="color:#000000">value2</span> <span style="color:#770088">return</span> <span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'_mpp'</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>),<span style="color:#000000">auto_value</span> <span style="color:#aa5500">#(最近p个月的ft最小值)/ (最近(p,2p)个月的ft最小值)</span> <span style="color:#770088">def</span> <span style="color:#0000ff">Npp</span>(<span style="color:#000000">ft</span>,<span style="color:#000000">p</span>): <span style="color:#000000">df1</span>=<span style="color:#000000">data</span>.<span style="color:#000000">loc</span>[:,<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'1'</span>:<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>)] <span style="color:#000000">value1</span>=<span style="color:#000000">np</span>.<span style="color:#000000">nanmin</span>(<span style="color:#000000">df1</span>,<span style="color:#000000">axis</span> = <span style="color:#116644">1</span> ) <span style="color:#000000">df2</span>=<span style="color:#000000">data</span>.<span style="color:#000000">loc</span>[:,<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>):<span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#116644">2</span><span style="color:#981a1a">*</span><span style="color:#000000">p</span>)] <span style="color:#000000">value2</span>=<span style="color:#000000">np</span>.<span style="color:#000000">nanmin</span>(<span style="color:#000000">df2</span>,<span style="color:#000000">axis</span> = <span style="color:#116644">1</span> ) <span style="color:#000000">auto_value</span> = <span style="color:#000000">value1</span><span style="color:#981a1a">/</span><span style="color:#000000">value2</span> <span style="color:#770088">return</span> <span style="color:#000000">ft</span><span style="color:#981a1a">+</span><span style="color:#aa1111">'_npp'</span><span style="color:#981a1a">+</span><span style="color:#3300aa">str</span>(<span style="color:#000000">p</span>),<span style="color:#000000">auto_value</span> </span>
-
将上面衍生的方法封装成函数
<span style="background-color:#f8f8f8"><span style="color:#aa5500">#定义批量调用双参数的函数 </span> <span style="color:#770088">def</span> <span style="color:#0000ff">auto_var2</span>(<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>): <span style="color:#aa5500">#global data_new</span> <span style="color:#770088">try</span>: <span style="color:#000000">columns_name</span>,<span style="color:#000000">values</span>=<span style="color:#000000">Num</span>(<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#000000">data_new</span>[<span style="color:#000000">columns_name</span>]=<span style="color:#000000">values</span> <span style="color:#770088">except</span>: <span style="color:#3300aa">print</span>(<span style="color:#aa1111">"Num PARSE ERROR"</span>,<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#770088">try</span>: <span style="color:#000000">columns_name</span>,<span style="color:#000000">values</span>=<span style="color:#000000">Nmz</span>(<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#000000">data_new</span>[<span style="color:#000000">columns_name</span>]=<span style="color:#000000">values</span> <span style="color:#770088">except</span>: <span style="color:#3300aa">print</span>(<span style="color:#aa1111">"Nmz PARSE ERROR"</span>,<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#770088">try</span>: <span style="color:#000000">columns_name</span>,<span style="color:#000000">values</span>=<span style="color:#000000">Evr</span>(<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#000000">data_new</span>[<span style="color:#000000">columns_name</span>]=<span style="color:#000000">values</span> <span style="color:#770088">except</span>: <span style="color:#3300aa">print</span>(<span style="color:#aa1111">"Evr PARSE ERROR"</span>,<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#770088">try</span>: <span style="color:#000000">columns_name</span>,<span style="color:#000000">values</span>=<span style="color:#000000">Avg</span>(<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#000000">data_new</span>[<span style="color:#000000">columns_name</span>]=<span style="color:#000000">values</span> <span style="color:#770088">except</span>: <span style="color:#3300aa">print</span>(<span style="color:#aa1111">"Avg PARSE ERROR"</span>,<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#770088">try</span>: <span style="color:#000000">columns_name</span>,<span style="color:#000000">values</span>=<span style="color:#000000">Tot</span>(<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#000000">data_new</span>[<span style="color:#000000">columns_name</span>]=<span style="color:#000000">values</span> <span style="color:#770088">except</span>: <span style="color:#3300aa">print</span>(<span style="color:#aa1111">"Tot PARSE ERROR"</span>,<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#770088">try</span>: <span style="color:#000000">columns_name</span>,<span style="color:#000000">values</span>=<span style="color:#000000">Tot2T</span>(<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#000000">data_new</span>[<span style="color:#000000">columns_name</span>]=<span style="color:#000000">values</span> <span style="color:#770088">except</span>: <span style="color:#3300aa">print</span>(<span style="color:#aa1111">"Tot2T PARSE ERROR"</span>,<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#770088">try</span>: <span style="color:#000000">columns_name</span>,<span style="color:#000000">values</span>=<span style="color:#000000">Max</span>(<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#000000">data_new</span>[<span style="color:#000000">columns_name</span>]=<span style="color:#000000">values</span> <span style="color:#770088">except</span>: <span style="color:#3300aa">print</span>(<span style="color:#aa1111">"Tot PARSE ERROR"</span>,<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#770088">try</span>: <span style="color:#000000">columns_name</span>,<span style="color:#000000">values</span>=<span style="color:#000000">Max</span>(<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#000000">data_new</span>[<span style="color:#000000">columns_name</span>]=<span style="color:#000000">values</span> <span style="color:#770088">except</span>: <span style="color:#3300aa">print</span>(<span style="color:#aa1111">"Max PARSE ERROR"</span>,<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#770088">try</span>: <span style="color:#000000">columns_name</span>,<span style="color:#000000">values</span>=<span style="color:#000000">Min</span>(<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#000000">data_new</span>[<span style="color:#000000">columns_name</span>]=<span style="color:#000000">values</span> <span style="color:#770088">except</span>: <span style="color:#3300aa">print</span>(<span style="color:#aa1111">"Min PARSE ERROR"</span>,<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#770088">try</span>: <span style="color:#000000">columns_name</span>,<span style="color:#000000">values</span>=<span style="color:#000000">Msg</span>(<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#000000">data_new</span>[<span style="color:#000000">columns_name</span>]=<span style="color:#000000">values</span> <span style="color:#770088">except</span>: <span style="color:#3300aa">print</span>(<span style="color:#aa1111">"Msg PARSE ERROR"</span>,<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#770088">try</span>: <span style="color:#000000">columns_name</span>,<span style="color:#000000">values</span>=<span style="color:#000000">Msz</span>(<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#000000">data_new</span>[<span style="color:#000000">columns_name</span>]=<span style="color:#000000">values</span> <span style="color:#770088">except</span>: <span style="color:#3300aa">print</span>(<span style="color:#aa1111">"Msz PARSE ERROR"</span>,<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#770088">try</span>: <span style="color:#000000">columns_name</span>,<span style="color:#000000">values</span>=<span style="color:#000000">Cav</span>(<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#000000">data_new</span>[<span style="color:#000000">columns_name</span>]=<span style="color:#000000">values</span> <span style="color:#770088">except</span>: <span style="color:#3300aa">print</span>(<span style="color:#aa1111">"Cav PARSE ERROR"</span>,<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#770088">try</span>: <span style="color:#000000">columns_name</span>,<span style="color:#000000">values</span>=<span style="color:#000000">Cmn</span>(<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#000000">data_new</span>[<span style="color:#000000">columns_name</span>]=<span style="color:#000000">values</span> <span style="color:#770088">except</span>: <span style="color:#3300aa">print</span>(<span style="color:#aa1111">"Cmn PARSE ERROR"</span>,<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#770088">try</span>: <span style="color:#000000">columns_name</span>,<span style="color:#000000">values</span>=<span style="color:#000000">Std</span>(<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#000000">data_new</span>[<span style="color:#000000">columns_name</span>]=<span style="color:#000000">values</span> <span style="color:#770088">except</span>: <span style="color:#3300aa">print</span>(<span style="color:#aa1111">"Std PARSE ERROR"</span>,<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#770088">try</span>: <span style="color:#000000">columns_name</span>,<span style="color:#000000">values</span>=<span style="color:#000000">Cva</span>(<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#000000">data_new</span>[<span style="color:#000000">columns_name</span>]=<span style="color:#000000">values</span> <span style="color:#770088">except</span>: <span style="color:#3300aa">print</span>(<span style="color:#aa1111">"Cva PARSE ERROR"</span>,<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#770088">try</span>: <span style="color:#000000">columns_name</span>,<span style="color:#000000">values</span>=<span style="color:#000000">Cmm</span>(<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#000000">data_new</span>[<span style="color:#000000">columns_name</span>]=<span style="color:#000000">values</span> <span style="color:#770088">except</span>: <span style="color:#3300aa">print</span>(<span style="color:#aa1111">"Cmm PARSE ERROR"</span>,<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#770088">try</span>: <span style="color:#000000">columns_name</span>,<span style="color:#000000">values</span>=<span style="color:#000000">Cnm</span>(<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#000000">data_new</span>[<span style="color:#000000">columns_name</span>]=<span style="color:#000000">values</span> <span style="color:#770088">except</span>: <span style="color:#3300aa">print</span>(<span style="color:#aa1111">"Cnm PARSE ERROR"</span>,<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#770088">try</span>: <span style="color:#000000">columns_name</span>,<span style="color:#000000">values</span>=<span style="color:#000000">Cxm</span>(<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#000000">data_new</span>[<span style="color:#000000">columns_name</span>]=<span style="color:#000000">values</span> <span style="color:#770088">except</span>: <span style="color:#3300aa">print</span>(<span style="color:#aa1111">"Cxm PARSE ERROR"</span>,<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#770088">try</span>: <span style="color:#000000">columns_name</span>,<span style="color:#000000">values</span>=<span style="color:#000000">Cxp</span>(<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#000000">data_new</span>[<span style="color:#000000">columns_name</span>]=<span style="color:#000000">values</span> <span style="color:#770088">except</span>: <span style="color:#3300aa">print</span>(<span style="color:#aa1111">"Cxp PARSE ERROR"</span>,<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#770088">try</span>: <span style="color:#000000">columns_name</span>,<span style="color:#000000">values</span>=<span style="color:#000000">Ran</span>(<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#000000">data_new</span>[<span style="color:#000000">columns_name</span>]=<span style="color:#000000">values</span> <span style="color:#770088">except</span>: <span style="color:#3300aa">print</span>(<span style="color:#aa1111">"Ran PARSE ERROR"</span>,<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#770088">try</span>: <span style="color:#000000">columns_name</span>,<span style="color:#000000">values</span>=<span style="color:#000000">Nci</span>(<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#000000">data_new</span>[<span style="color:#000000">columns_name</span>]=<span style="color:#000000">values</span> <span style="color:#770088">except</span>: <span style="color:#3300aa">print</span>(<span style="color:#aa1111">"Nci PARSE ERROR"</span>,<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#770088">try</span>: <span style="color:#000000">columns_name</span>,<span style="color:#000000">values</span>=<span style="color:#000000">Ncd</span>(<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#000000">data_new</span>[<span style="color:#000000">columns_name</span>]=<span style="color:#000000">values</span> <span style="color:#770088">except</span>: <span style="color:#3300aa">print</span>(<span style="color:#aa1111">"Ncd PARSE ERROR"</span>,<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#770088">try</span>: <span style="color:#000000">columns_name</span>,<span style="color:#000000">values</span>=<span style="color:#000000">Ncn</span>(<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#000000">data_new</span>[<span style="color:#000000">columns_name</span>]=<span style="color:#000000">values</span> <span style="color:#770088">except</span>: <span style="color:#3300aa">print</span>(<span style="color:#aa1111">"Ncn PARSE ERROR"</span>,<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#770088">try</span>: <span style="color:#000000">columns_name</span>,<span style="color:#000000">values</span>=<span style="color:#000000">Pdn</span>(<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#000000">data_new</span>[<span style="color:#000000">columns_name</span>]=<span style="color:#000000">values</span> <span style="color:#770088">except</span>: <span style="color:#3300aa">print</span>(<span style="color:#aa1111">"Pdn PARSE ERROR"</span>,<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#770088">try</span>: <span style="color:#000000">columns_name</span>,<span style="color:#000000">values</span>=<span style="color:#000000">Cmx</span>(<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#000000">data_new</span>[<span style="color:#000000">columns_name</span>]=<span style="color:#000000">values</span> <span style="color:#770088">except</span>: <span style="color:#3300aa">print</span>(<span style="color:#aa1111">"Cmx PARSE ERROR"</span>,<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#770088">try</span>: <span style="color:#000000">columns_name</span>,<span style="color:#000000">values</span>=<span style="color:#000000">Cmp</span>(<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#000000">data_new</span>[<span style="color:#000000">columns_name</span>]=<span style="color:#000000">values</span> <span style="color:#770088">except</span>: <span style="color:#3300aa">print</span>(<span style="color:#aa1111">"Cmp PARSE ERROR"</span>,<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#770088">try</span>: <span style="color:#000000">columns_name</span>,<span style="color:#000000">values</span>=<span style="color:#000000">Cnp</span>(<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#000000">data_new</span>[<span style="color:#000000">columns_name</span>]=<span style="color:#000000">values</span> <span style="color:#770088">except</span>: <span style="color:#3300aa">print</span>(<span style="color:#aa1111">"Cnp PARSE ERROR"</span>,<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#770088">try</span>: <span style="color:#000000">columns_name</span>,<span style="color:#000000">values</span>=<span style="color:#000000">Msx</span>(<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#000000">data_new</span>[<span style="color:#000000">columns_name</span>]=<span style="color:#000000">values</span> <span style="color:#770088">except</span>: <span style="color:#3300aa">print</span>(<span style="color:#aa1111">"Msx PARSE ERROR"</span>,<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#770088">try</span>: <span style="color:#000000">columns_name</span>,<span style="color:#000000">values</span>=<span style="color:#000000">Nci</span>(<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#000000">data_new</span>[<span style="color:#000000">columns_name</span>]=<span style="color:#000000">values</span> <span style="color:#770088">except</span>: <span style="color:#3300aa">print</span>(<span style="color:#aa1111">"Nci PARSE ERROR"</span>,<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#770088">try</span>: <span style="color:#000000">columns_name</span>,<span style="color:#000000">values</span>=<span style="color:#000000">Trm</span>(<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#000000">data_new</span>[<span style="color:#000000">columns_name</span>]=<span style="color:#000000">values</span> <span style="color:#770088">except</span>: <span style="color:#3300aa">print</span>(<span style="color:#aa1111">"Trm PARSE ERROR"</span>,<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#770088">try</span>: <span style="color:#000000">columns_name</span>,<span style="color:#000000">values</span>=<span style="color:#000000">Bup</span>(<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#000000">data_new</span>[<span style="color:#000000">columns_name</span>]=<span style="color:#000000">values</span> <span style="color:#770088">except</span>: <span style="color:#3300aa">print</span>(<span style="color:#aa1111">"Bup PARSE ERROR"</span>,<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#770088">try</span>: <span style="color:#000000">columns_name</span>,<span style="color:#000000">values</span>=<span style="color:#000000">Mai</span>(<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#000000">data_new</span>[<span style="color:#000000">columns_name</span>]=<span style="color:#000000">values</span> <span style="color:#770088">except</span>: <span style="color:#3300aa">print</span>(<span style="color:#aa1111">"Mai PARSE ERROR"</span>,<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#770088">try</span>: <span style="color:#000000">columns_name</span>,<span style="color:#000000">values</span>=<span style="color:#000000">Mad</span>(<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#000000">data_new</span>[<span style="color:#000000">columns_name</span>]=<span style="color:#000000">values</span> <span style="color:#770088">except</span>: <span style="color:#3300aa">print</span>(<span style="color:#aa1111">"Mad PARSE ERROR"</span>,<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#770088">try</span>: <span style="color:#000000">columns_name</span>,<span style="color:#000000">values</span>=<span style="color:#000000">Rpp</span>(<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#000000">data_new</span>[<span style="color:#000000">columns_name</span>]=<span style="color:#000000">values</span> <span style="color:#770088">except</span>: <span style="color:#3300aa">print</span>(<span style="color:#aa1111">"Rpp PARSE ERROR"</span>,<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#770088">try</span>: <span style="color:#000000">columns_name</span>,<span style="color:#000000">values</span>=<span style="color:#000000">Dpp</span>(<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#000000">data_new</span>[<span style="color:#000000">columns_name</span>]=<span style="color:#000000">values</span> <span style="color:#770088">except</span>: <span style="color:#3300aa">print</span>(<span style="color:#aa1111">"Dpp PARSE ERROR"</span>,<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#770088">try</span>: <span style="color:#000000">columns_name</span>,<span style="color:#000000">values</span>=<span style="color:#000000">Mpp</span>(<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#000000">data_new</span>[<span style="color:#000000">columns_name</span>]=<span style="color:#000000">values</span> <span style="color:#770088">except</span>: <span style="color:#3300aa">print</span>(<span style="color:#aa1111">"Mpp PARSE ERROR"</span>,<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#770088">try</span>: <span style="color:#000000">columns_name</span>,<span style="color:#000000">values</span>=<span style="color:#000000">Npp</span>(<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#000000">data_new</span>[<span style="color:#000000">columns_name</span>]=<span style="color:#000000">values</span> <span style="color:#770088">except</span>: <span style="color:#3300aa">print</span>(<span style="color:#aa1111">"Npp PARSE ERROR"</span>,<span style="color:#000000">feature</span>,<span style="color:#000000">p</span>) <span style="color:#770088">return</span> <span style="color:#000000">data_new</span>.<span style="color:#000000">columns</span>.<span style="color:#000000">size</span></span>
-
对之前数据应用封装的函数
<span style="background-color:#f8f8f8">data_new = pd.DataFrame() for p in range(1, 12): for inv in ['ft', 'gt']: auto_var2(inv, p) </span>
<span style="background-color:#f8f8f8">data_new.columns.tolist() </span>
显示结果
<span style="background-color:#f8f8f8">array(['ft_num1', 'ft_evr1', 'ft_avg1', 'ft_tot1', 'ft_tot2t1', 'ft_max1', 'ft_min1', 'ft_msg1', 'ft_msz1', 'ft_cav1', 'ft_cmn1', 'ft_std1', 'ft_cva1', 'ft_cmm1', 'ft_cnm1', 'ft_cxm1', 'ft_cxp1', 'ft_ran1', 'ft_nci1', 'ft_ncd1', 'ft_ncn1', 'ft_pdn1', 'ft_cmx1', 'ft_cmp1', 'ft_cnp1', 'ft_msx1', 'ft_bup1', 'ft_rpp1', 'ft_dpp1', 'ft_mpp1', 'ft_npp1', 'gt_num1', 'gt_evr1', 'gt_avg1', 'gt_tot1', 'gt_tot2t1', 'gt_max1', 'gt_min1', 'gt_msg1', 'gt_msz1', 'gt_cav1', 'gt_cmn1', 'gt_std1', 'gt_cva1', 'gt_cmm1', 'gt_cnm1', 'gt_cxm1', 'gt_cxp1', ........]) </span>
-
上面这种无差别聚合方法进行聚合得到的结果,通常具有较高的共线性,但信息量并无明显增加,影响模型的鲁棒性和稳定性。
-
评分卡模型对模型的稳定性要求远高于其性能
-
在时间窗口为1年的场景下,p值会通过先验知识,人为选择3、6、12等,而不是遍历全部取值1~12
-
在后续特征筛选时,会根据变量的显著性、共线性等指标进行进一步筛选
-
-
-
最近一次(current) 和历史 (history)做对比
-
current/history
-
current-history
-
用户时间序列缺失值处理
-
用户时间序列缺失值处理优先考虑补零:大多数特征都是计数,缺失用0补充
-
用户没有历史购物记录: max_gmv min_gmv 都可以用0补充
-
用0填充缺失值带来的问题
-
cur/history_avg: 0/0 cur/history_avg:1/0
-
-
根据风险趋势填补缺失值 (违约概率大小 无历
-
史购物记录违约概率>有一单历史购物记录>有两单)
用户没有历史购物记录 cur/history_avg : 0/0? 可以填充-2
用户有一单历史购物记录 cur/history_avg : 1/0? 可以填充-1
用户有两单历史购物记录 cur/history_avg : 1/1 可以计算出>0的值
-
-
用户最后一次逾期距今天数,如果是白户如何填补缺失值?
-
如果缺失值比较多的时候,考虑单独做成特征
-
举例:用户授权GPS序列特征 gps_count_last_3month
-
缺失意味着用户未授权GPS权限
-
-
缺失有明显业务含义,可以填补业务默认值
-
授信额度(用初始额度)
-
-
缺失值处理小结
缺失值 处理 一般计数类特征 优先考虑用0填充 有风险趋势 按风险趋势填补 缺失数值过多 考虑新增是否缺失的特征列 有业务含义 填补业务默认值
时间序列数据的未来信息
-
以借贷2发生的时间为观测点,下表中的未来信息会把大量退货行为的用户认为是坏客户,但上线后效果会变差
-
特征构建时的补救方法
-
对未来信息窗口外的订单计算有效单的特征 net order,nmv
-
NMV:Net Merchandise Value
-
-
对未来信息窗口内订单只计算一般特征 order,GMV
-
GMV:Gross Merchandise Volume
-
-
-
历史信贷特征也非常容易出现未来信息
-
举例:
信用卡 每月1日为账单日,每月10日为还款日,次月10日左右为M1
-
在上图所示的截面时间(如3月5日)是看不到2月账单的逾期DPD30的情况的
-
但如果数据库没有快照表会导致我们可以拿到2月账单的DPD30情况
-
解决方案跟上面例子一样,分区间讨论,可以把账单分成3类
-
当前未出账账单
-
最后一个已出账账单
-
其他已出账账单 (只有这个特征可以构建逾期类特征)
-
-
-
未来信息处理小结
-
及时增加快照表
-
没有快照表的情况下,将数据区分为是否有未来信息的区间,分别进行特征构造
-
2.3 特征变换
分箱(离散化)
-
概念
-
特征构造的过程中,对特征做分箱处理时必不可少的过程
-
分箱就是将连续变量离散化,合并成较少的状态
-
-
分箱的作用
-
离散特征的增加和减少都很容易,易于模型的快速迭代;
-
稀疏向量内积乘法运算速度快,计算结果方便存储,容易扩展;
-
分箱(离散化)后的特征对异常数据有很强的鲁棒性
-
单变量分箱(离散化)为N个后,每个变量有单独的权重,相当于为模型引入了非线性,能够提升模型表达能力
-
分箱(离散化)后可以进行特征交叉,由M+N个变量变为M*N个变量,进一步引入非线性,提升表达能力;
-
分箱(离散化)后,模型会更稳定,如对年龄离散化,20-30为一个区间,不会因为年龄+1就变成一个新的特征。
-
特征离散化以后,可以将缺失作为独立的一类带入模型
-
-
怎么离散化(分箱)比较好?
-
等频?等距?还是其他
-
分成几箱?10箱?100箱?...
-
分箱时缺失值怎么办
-
-
常用分箱方法:卡方分箱、决策树分箱、等频分箱、聚类分箱
-
等频分箱:
-
按数据的分布,均匀切分,每个箱体里的样本数基本一样
-
在样本少的时候泛化性较差
-
在样本不均衡时可能无法分箱
-
特征分析常用等频分箱
-
-
等距分箱:
-
按数据的特征值的间距均匀切分,每个箱体的数值距离一样
-
一定可以分箱
-
无法保证箱体样本数均匀
-
信用分统计时常用等距分箱
-
-
-
卡方分箱:使用卡方检验确定最优分箱阈值
-
将数据按等频或等距分箱后,计算卡方值,将卡方值较小的两个相邻箱体合并
使得不同箱体的好坏样本比例区别放大,容易获得高IV
-
卡方分箱是利用独立性检验来挑选箱划分节点的阈值。卡方分箱的过程可以拆分为初始化和合并两步
-
初始化:根据连续变量值大小进行排序,构建最初的离散化
-
-
合并:遍历相邻两项合并的卡方值,将卡方值最小的两组合并,不断重复直到满足分箱数目要求
[22-35] (35-45] (45-55] (55-65] 总计 good 3 2 2 1 8 bad 1 2 2 3 8 p 50% p(good+bad) 2 2 2 2 - chi2 (1-2)^2/2=1/2 (2-2)^2/2=0 (2-2)^2/2=0 (3-2)^2/2=1/2 - 合并坏人比例接近平均水平的箱体,留下比例差异大的箱体
-
案例:使用toad库进行分箱处理
数据集使用germancredit
-
Toad 是专为工业界模型开发设计的Python工具包,特别针对评分卡的开发
-
Toad 的功能覆盖了建模全流程,从 EDA、特征工程、特征筛选 到 模型验证和评分卡转化
-
Toad 的主要功能极大简化了建模中最重要最费时的流程,即特征筛选和分箱。
<span style="background-color:#f8f8f8"><span style="color:#333333">import pandas as pd
import numpy as np
import toad
data = pd.read_csv('data/germancredit.csv')
data.replace({'good':0,'bad':1},inplace=True)
print(data.shape) # 1000 data and 20 features
data.head()
</span></span>
显示结果
<span style="background-color:#f8f8f8">(1000, 21) </span>
status.of.existing.checking.account duration.in.month credit.history purpose credit.amount savings.account.and.bonds present.employment.since installment.rate.in.percentage.of.disposable.income personal.status.and.sex other.debtors.or.guarantors ... property age.in.years other.installment.plans housing number.of.existing.credits.at.this.bank job number.of.people.being.liable.to.provide.maintenance.for telephone foreign.worker creditability 0 ... < 0 DM 6 critical account/ other credits existing (not ... radio/television 1169 unknown/ no savings account ... >= 7 years 4 male : single none ... real estate 67 none own 2 skilled employee / official 1 yes, registered under the customers name yes 0 1 0 <= ... < 200 DM 48 existing credits paid back duly till now radio/television 5951 ... < 100 DM 1 <= ... < 4 years 2 female : divorced/separated/married none ... real estate 22 none own 1 skilled employee / official 1 none yes 1 2 no checking account 12 critical account/ other credits existing (not ... education 2096 ... < 100 DM 4 <= ... < 7 years 2 male : single none ... real estate 49 none own 1 unskilled - resident 2 none yes 0 3 ... < 0 DM 42 existing credits paid back duly till now furniture/equipment 7882 ... < 100 DM 4 <= ... < 7 years 2 male : single guarantor ... building society savings agreement/ life insur... 45 none for free 1 skilled employee / official 2 none yes 0 4 ... < 0 DM 24 delay in paying off in the past car (new) 4870 ... < 100 DM 1 <= ... < 4 years 3 male : single none ... unknown / no property 53 none for free 2 skilled employee / official 2 none yes 1 5 rows × 21 columns
-
数据字段说明
-
Status of existing checking account(现有支票帐户的存款状态)
-
Duration in month(持续月数)
-
Credit history(信用历史记录)
-
Purpose(申请目的)
-
Credit amount(信用保证金额)
-
Savings account/bonds(储蓄账户/债券金额)
-
Present employment since(当前就业年限)
-
Installment rate in percentage of disposable income(可支配收入占比)
-
Personal status and gender(个人婚姻状态及性别)
-
Other debtors / guarantors(其他债务人或担保人)
-
Present residence since(当前居民年限)
-
Property(财产)
-
Age in years(年龄)
-
Other installment plans (其他分期付款计划)
-
Housing(房屋状况)
-
Number of existing credits at this bank(在该银行已有的信用卡数)
-
Job(工作性质)
-
Number of people being liable to provide maintenance for(可提供维护人数)
-
Telephone(是否留存电话)
-
foreign worker(是否外国工人)
-
creditability 数据标签
-
-
toad 中的combiner类用来进行分箱处理
<span style="background-color:#f8f8f8"><span style="color:#333333"># 初始化一个combiner类
combiner = toad.transform.Combiner()
# 训练数据并指定分箱方法,其它参数可选 # min_samples: 每箱至少包含样本量,可以是数字或者占比
combiner.fit(data,y='creditability',method='chi',min_samples = 0.05)
# 以字典形式保存分箱结果
bins = combiner.export()
#查看分箱结果
print('duration.in.month:', bins['duration.in.month'])
</span></span>
显示结果
<span style="background-color:#f8f8f8">duration.in.month: [9, 12, 13, 16, 36, 45] </span>
-
通常使用双变量图(Bivar图 Bivariate graph)来评价分箱结果。注意,信贷风险分析中Bivar图,纵轴固定为负样本占比
-
使用bin_plot()画图对分箱进行调整
<span style="background-color:#f8f8f8"><span style="color:#333333">import matplotlib.pyplot as plt
%matplotlib inline
from toad.plot import bin_plot
c2 = toad.transform.Combiner()
c2.fit(data[['duration.in.month','creditability']],
y='creditability', method='chi',n_bins=7)
transformed = c2.transform(data[['duration.in.month','creditability']],labels=True)
#传给bin_plot的数据必须是分箱转化之后的
bin_plot(transformed,x='duration.in.month',target='creditability')
</span></span>
显示结果
<span style="background-color:#f8f8f8"><AxesSubplot:xlabel='duration.in.month', ylabel='prop'> </span>
-
上图中柱形图表示每一箱的占比,折线图表示每一箱的坏样本率。一般折线图要呈现出单调的趋势
-
可以通过调整箱数实现单调趋势
<span style="background-color:#f8f8f8"><span style="color:#333333">c2 = toad.transform.Combiner()
c2.fit(data[['duration.in.month','creditability']],
y='creditability', method='chi',n_bins=5) # 改成5箱
transformed = c2.transform(data[['duration.in.month','creditability']],labels=True)
#传给bin_plot的数据必须是分箱转化之后的
bin_plot(transformed,x='duration.in.month',target='creditability')
</span></span>
显示结果
<span style="background-color:#f8f8f8"><AxesSubplot:xlabel='duration.in.month', ylabel='prop'> </span>
-
其它分箱方法:聚类分箱(k-means), 决策树分箱,等频分箱,等距分箱
-
各种分箱方法对比
<span style="background-color:#f8f8f8"><span style="color:#333333">for method in ['chi', 'dt', 'quantile', 'step', 'kmeans']:
c2 = toad.transform.Combiner()
c2.fit(data[['duration.in.month','creditability']],
y='creditability', method=method, n_bins=5)
bin_plot(c2.transform(data[['duration.in.month','creditability']],labels=True),
x='duration.in.month',target='creditability')
</span></span>
-
从单调性和模型稳定性角度考虑一般使用卡方分箱
-
多值无序类别特征需要做encoding处理,常见encoding方法:Onehot Encoding、Label Encoding、WOE Encoding
-
Onehot Encoding
-
Label Encoding
婚姻状态 婚姻状态 统计出不同婚姻状态下的逾期率作为数值标签
未婚 dpd rate
已婚 dpd rate
离异 dpd rate
丧偶 dpd rate
缺点:数据量少的情况下,某些数据可能有偏差
-
WOE Encoding
WOE(Weight of Evidence) 反映单特征在好坏用户区分度的度量
$$WOE_k=log(p^k{good}/p^k{bad})$$ 好用户比例/坏用户比例
婚姻状态 Good Bad G-B ln(G/B) WOE 未婚 30% 20% 10% 0.405 0.405 已婚 40% 10% 30% 1.386 1.386 离异 10% 40% -30% -1.386 -1.386 丧偶 20% 30% -10% -0.405 -0.405 总计 100% 100% -
使用toad计算woe
<span style="background-color:#f8f8f8">from sklearn.model_selection import train_test_split X_train,X_test,Y_train,Y_test = train_test_split(data.drop('creditability',axis=1),data['creditability'],test_size=0.25,random_state=450) data_train = pd.concat([X_train,Y_train],axis=1) #增加一列区分训练/测试的特征 data_train['type'] = 'train' data_test = pd.concat([X_test,Y_test],axis=1) data_test['type'] = 'test' #设置分箱边界 adj_bin = {'duration.in.month': [9, 12, 18, 33]} c2 = toad.transform.Combiner() c2.set_rules(adj_bin) data_ = pd.concat([data_train,data_test],axis = 0) #分箱 temp_data = c2.transform(data_[['duration.in.month','creditability','type']]) #绘制badrate_plot图 from toad.plot import badrate_plot, proportion_plot badrate_plot(temp_data, target = 'creditability', x = 'type', by = 'duration.in.month') #绘制每一箱占比情况图 proportion_plot(temp_data['duration.in.month']) </span>
显示结果
<span style="background-color:#f8f8f8"><AxesSubplot:xlabel='value', ylabel='proportion'> </span>
-
上面第一张图中的第一箱和第二箱的bad_rate存在倒挂,说明bad_rate不单调,需要调整。可以将第一箱和第二箱进行合并
<span style="background-color:#f8f8f8"># 假定将第一箱、第二箱合并 adj_bin = {'duration.in.month': [9,18,33]} c2.set_rules(adj_bin) temp_data = c2.transform(data_[['duration.in.month','creditability','type']]) badrate_plot(temp_data, target = 'creditability', x = 'type', by = 'duration.in.month') </span>
显示结果
<span style="background-color:#f8f8f8">#将特征的值转化为分箱的箱号。 binned_data = c2.transform(data_train) #计算WOE transer = toad.transform.WOETransformer() #对WOE的值进行转化,映射到原数据集上。对训练集用fit_transform,测试集用transform. data_tr_woe = transer.fit_transform(binned_data, binned_data['creditability'], exclude=['creditability','type']) data_tr_woe.head() </span>
显示结果
status.of.existing.checking.account duration.in.month credit.history purpose credit.amount savings.account.and.bonds present.employment.since installment.rate.in.percentage.of.disposable.income personal.status.and.sex other.debtors.or.guarantors ... age.in.years other.installment.plans housing number.of.existing.credits.at.this.bank job number.of.people.being.liable.to.provide.maintenance.for telephone foreign.worker creditability type 569 0.786313 0.786622 0.069322 -0.384125 0.333152 0.244802 0.002898 -0.056341 0.355058 0.0 ... 0.085604 -0.157497 -0.174441 0.039485 0.002648 0.012822 0.001722 0.043742 1 train 574 0.363027 -0.279729 0.069322 -0.384125 -0.159408 0.244802 -0.173326 0.154169 -0.212356 0.0 ... 0.085604 -0.157497 -0.174441 -0.071350 -0.298467 0.012822 -0.001130 0.043742 0 train 993 0.786313 0.786622 0.069322 0.141484 0.333152 0.244802 0.534527 0.154169 -0.212356 0.0 ... 0.085604 -0.157497 -0.174441 0.039485 0.311383 0.012822 0.001722 0.043742 0 train 355 0.363027 0.099812 0.069322 0.272947 -0.159408 0.244802 0.399313 0.154169 -0.212356 0.0 ... 0.546949 0.605057 -0.174441 0.039485 -0.298467 0.012822 -0.001130 0.043742 1 train 508 -1.072960 0.099812 0.069322 -0.384125 -0.159408 0.244802 0.002898 0.154169 -0.302447 0.0 ... 0.085604 -0.157497 -0.174441 0.039485 0.002648 0.012822 -0.001130 0.043742 0 train -
WOE理解:当前组中好用户和坏用户的比值与所有样本中这个比值的差异。差异通过对这两个比值取对数来表示
-
WOE越大,差异越大,这个分组里的好用户的可能性就越大
-
WOE越小,差异越小,这个分组里的好用户的可能性也就越小。
-
-
分箱结果对WOE结果有直接影响,分箱不同,WOE映射值也会有很大的不同
-
箱的总数在5~10箱(可以适当调整,通常不超过10箱)
-
并且将每一箱之间的负样本占比差值尽可能大作为箱合并的基本原则
-
每一箱的样本量不能小于整体样本的5%,原则是每一箱的频数需要具有统计意义
-
-
三种encoding的利弊
-
优势 | 劣势 | |
---|---|---|
Onehot Encoding | 简单易处理、稳定、无需归一化、不依赖历史数据 | 数据过于稀疏 |
Label Encoding | 区分效果好,维度小 | 需统计历史数据、不稳定、需要归一化 |
WOE Encoding | 区分效果好,维度小,不需要归一化 | 需统计历史数据、不稳定 |
多值有序类别型特征编码
-
学历:本科,硕士,博士
-
一定程度上学历高低能直接对应用户的信用风险,可以当做有序特征
-
可以把多值有序特征转换为1,2,3...的数值
-
本科 → 1,硕士→2,博士→3
特征组合
又叫特征交叉(Feature crossing),指不同特征之间基于常识、经验、数据挖掘技术进行分段组合实现特征构造,产生包含更多信息的新特征。
特征维度 | 男程序猿 | 女程序媛 |
---|---|---|
青年 | 青年男程序猿 | 青年女程序媛 |
中年 | 中年男程序猿 | 中年女程序媛 |
-
可以通过决策树模型,基于特定指标,贪心地搜索最优的特征组合形式。上一小结最后的案例为例
-
基于上述规则可以得出以下特征
<span style="background-color:#f8f8f8"><span style="color:#333333">x['n1'] = x.apply(lambda x:1 if x.amount_tot>48077.5 \
and amount_cnt<=3.5 else 0)
x['n2'] = x.apply(lambda x:1 if x.amount_tot>48077.5 \
and amount_cnt>3.5 else 0)
</span></span>
-
利用决策树实现特征的自动组合,可以有效降低建模人员的工作难度
2.4 用户关联特征
-
如何评价一个没有内部数据的新客?
-
使用外部第三方数据
-
把新用户关联到内部用户,使用关联到的老客信息评估
-
-
用户特征关联,可以考虑用倒排表做关联
-
用户→[特征1,特征2,特征3...]
-
特征→[用户1,用户2,用户3...]
-
-
举例:用户所在地区的统计特征
-
将用户申请时的GPS转化为geohash位置块
-
geohash:基本原理是将地球理解为一个二维平面,将平面递归分解成更小的子块,每个子块在一定经纬度范围内拥有相同的编码
-
-
对每个大小合适的位置块,统计申请时点GPS在该位置块的人的信用分
-
当新申请的人,查询其所在的位置块的平均信用分作为GPS倒排表特征
-
倒排表的组成:关键主键+统计指标
-
关键主键:新用户通过什么数据和平台存量用户发生关联
-
统计指标:使用存量用户的什么特征去评估这个新客户
-
-
常见统计指标
-
常见关联主键
-
-
信贷业务的特征要求:
-
逻辑简单
-
容易构造
-
容易排查错误
-
有强业务解释性
-
-
构造特征要从两个维度看数据:归纳+演绎
-
归纳:从大量数据的结果总结出规律(相关关系)
-
演绎:从假设推导出必然的结果(因果关系)
-
2.5 小结
-
特征工程准备工作
-
ER图
-
样本设计表
-
特征框架表
-
-
特征构建方法
-
用户静态信息特征
-
用户时间截面特征
-
用户时间序列特征
-
用户关联特征
-
-
缺失值处理
-
补零
-
风险趋势
-
增加缺失特征
-
业务默认值
-
-
未来信息处理
-
快照表
-
将数据区分成是否包含未来信息分别处理
-
-
特征构造的标准
-
简单
-
归纳+演绎
-