LR的变量选择问题

最新推荐文章于 2021-03-26 19:55:35 发布

OverTheMoon

最新推荐文章于 2021-03-26 19:55:35 发布

阅读量865

点赞数

本文链接：https://blog.csdn.net/qq_17377865/article/details/78853195

版权

这篇博客探讨了在Python中如何进行逻辑回归（LR）的变量选择问题，由于Python缺少forward backward stepwise方法，作者建议使用RFE包通过设置参数筛选变量，并结合gridsearch调优。此外，还分享了一种自定义的AIC准则stepwise函数，但遇到了AIC可能为负无穷的问题。为解决这个问题，提出了以ks统计量作为筛选标准，因为ks值越大越好。

摘要由CSDN通过智能技术生成

Python中没有forward backward stepwise方法。

使用RFE包

原理：参数中设定需要几个变量，每次按重要性筛去变量

参考：http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html

心得：可以考虑使用gridsearch来调节n_features这一参数

from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
reg = LogisticRegression(C=1, solver="newton-cg", max_iter = 1000, penalty = "l2")
model_select = RFE(estimator = reg, n_features_to_select = 4)

自己写了一个利用AIC准则做stepwise的函数：

import math
AIC = lambda estimator, X, y: 2*X.shape[1] + X.shape[0]*math.log(pow((y-(estimator.predict_proba(X))[:,0]), 2).sum()/X.shape[1])

import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression


def stepwise_selection(X, y, initial_list=[], verbose=True):
    """ Perform a forward-backward feature selection 
    based on p-value from statsmodels.api.OLS
    Arguments:
        X - pandas.DataFrame with candidate features
        y - list-like with the target
        initial_list - list of features to start with (column names of X)
        threshold_in - include a feature if its p-value < threshold_in
        threshold_out - exclude a feature if its p-value > thres