Interpretable Models - Logistic Regression&GLM&GAM

一、Logistic Regression

前面的可参考linear regression,Logistics Regression只是在后面加了logistic function

(1)logistic function

(2)Interpretation

二、广义模型

线性模型中的三个假设在现实中无法满足

Three assumptions of the linear model (left side): Gaussian distribution of the outcome given the features, additivity (= no interactions) and linear relationship. Reality usually does not adhere to those assumptions (right side): Outcomes might have non-Gaussian distributions, features might interact and the relationship might be nonlinear.

针对这三个问题solution如下

(1)y不服从高斯分布

solution:GLMs

(2)忽略了特征的交叉关系

solution:Adding interactions manually.因为原模型并没有捕捉到这个关系

(3)特征与target的关系并不是线性的

solution:Generalized Additive Models (GAMs); transformation of features.

三、GLM广义线性模式

GLM的公式:

The core concept of any GLM is: Keep the weighted sum of the features, but allow non-Gaussian outcome distributions and connect the expected mean of this distribution and the weighted sum through a possibly nonlinear function.

GLM的核心概念是在linear predictor的基础上,用指数族的概率分布(称为link function)来定义E。

比如:

y是什么的数量,可以考虑泊松分布

y是时长,总为正,可以考虑指数分布

y如果服从伯努利二项分布,则

如何寻找这个正确的link function呢?

结合target的分布、理论推导和模型如何拟合正确的数据。

书中举个了例子,预测咖啡销量的问题,linear regression的target服从正太分布,也就是说预测出来的值有负值,此处不符合需求,可以加个log-link或者Poisson。

可解释性和linear regression最大的不能取决于link function,比如logistics regression,effect interpretation不是加性的,是multiplicative的。exp(a + b) is exp(a) times exp(b)。

 

四、特征交叉

交叉特征的可解释性与单个特征不同,可以考虑引入可视化,以其中一个特征分开样本,横轴为另一个特征,纵轴为target

 

四、GAM广义加性模型 - The world is not linear

对于非线性关系,一般有三种方法

• Simple transformation of the feature (e.g. logarithm)

• Categorization of the feature
• Generalized Additive Models (GAMs)

4.1 Feature transformation

Using a feature transformation means that you replace the column of this feature in the data with a function of the feature, such as the logarithm, and fit the linear model as usual.

4.2 Feature categorization

The problem with this approach is that it needs more data, it is more likely to overfit and it is unclear how to discretize the feature meaningfully (equidistant intervals or quantiles? how many intervals?). I would only use discretization if there is a very strong case for it.需要多尝试离散化的尺度,如分箱分几箱,一般推荐用等频分箱,而不是等距,使得分布均匀些。

4.3 Generalized Additive Models (GAMs)

GAMs relax the restriction that the relationship must be a simple weighted sum, and instead assume that the outcome can be modeled by a sum of arbitrary functions of each feature.

spline 样条函数

即把多个函数预测出来的结果作加法

 

五、延伸

5.1

Q:My data violates the assumption of being independent and identically distributed (iid). For example, repeated measurements on the same patient.
A:Search for mixed models or generalized estimating equations.

5.2

Q:My model has heteroscedastic errors.
For example, when predicting the value of a house, the model errors are usually higher in expensive houses, which violates the homoscedasticity of the linear model.
A:Search for robust regression.

5.3

Q:I have outliers that strongly influence my model.

A:Search for robust regression.

5.4

Q:I want to predict the time until an event occurs.
Time-to-event data usually comes with censored measurements, which means that for some instances there was not enough time to observe the event. For example, a company wants to predict the failure of its ice machines, but only has data for two years. Some machines are still intact after two years, but might fail later.
A:Search for parametric survival models, cox regression, survival analysis.

5.5

Q:My outcome to predict is a category.
If the outcome has two categories use a logistic regression model, which models the probability for the categories.
If you have more categories, search for multinomial regression.
A:Logistic regression and multinomial regression are both GLMs.

5.6

Q:I want to predict ordered categories. For example school grades.
A:Search for proportional odds model.

5.7

Q:My outcome is a count (like number of children in a family).
A:Search for Poisson regression.
The Poisson model is also a GLM. You might also have the problem that the count value of 0 is very frequent.
Search for zero-inflated Poisson regression, hurdle model.

5.8

Q:I am not sure what features need to be included in the model to draw correct causal conclusions. For example, I want to know the effect of a drug on the blood pressure. The drug has a direct effect on some blood value and this blood value affects the outcome. Should I include the blood value into the regression model?
A:Search for causal inference, mediation analysis.

5.9

Q:I have missing data.
A:Search for multiple imputation.

5.10

Q:I want to integrate prior knowledge into my models.

A:Search for Bayesian inference.

 

该篇主要围绕linear regression的三个assumption来引出优化方法。最后的延伸部分可以作为参考,书的作者还是很用心。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值