python 位置参数溢出_Statsmodels引发“ expover溢出”和“ log被零除”警告，伪R平方为-inf...-CSDN博客

我想使用Statsmodels在Python中进行Logistic回归。

X和y分别有750行，y是二进制结果，X中的10个特征(包括intecept)。

这是X的前12行(最后一列是截距)：

lngdp_ lnpop sxp sxp2 gy1 frac etdo4590 geogia \07.36770916.2939800.1900.036100-1.682132.010.91617.50988316.4362580.1930.0372492.843132.010.91627.75918716.5892240.2690.0723614.986132.010.91637.92226116.7423840.3680.1354243.261132.010.91648.00235916.9010370.1700.0289001.602132.010.91657.92912617.0347860.1790.032041-1.465132.010.91666.59441315.6275630.3600.129600-9.3214134.000.64876.44888916.0378610.4760.226576-2.3563822.000.64888.52078616.9193340.0480.0023042.349434.010.85898.63710716.9919800.0500.0025002.326434.010.858108.70814417.0754890.0420.0017641.421465.010.858118.78048017.1517790.0800.0064001.447496.010.858peace intercept024.01.0184.01.02144.01.03204.01.04264.01.05324.01.061.01.0716.01.08112.01.09172.01.010232.01.011292.01.0

这是我的代码：

importstatsmodels.apiassm

logit=sm.Logit(y,X,missing='drop')result=logit.fit()print(result.summary())

这是输出：

Optimizationterminated successfully.Currentfunction value:infIterations9

/home/ipattern/anaconda3/lib/python3.6/site-packages/statsmodels/discrete/discrete_model.py:1214：RuntimeWarning：exp

返回1 /(1 + np.exp(-X))遇到溢出

/home/ipattern/anaconda3/lib/python3.6/site-packages/statsmodels/discrete/discrete_model.py:1264：RuntimeWarning：除以日志

返回的零np.sum(np.log(self.cdf(q * np.dot(X，params))))

LogitRegressionResults==============================================================================Dep.Variable:warsaNo.Observations:750Model:LogitDfResiduals:740Method:MLEDfModel:9Date:Tue,12Sep2017PseudoR-squ.:-infTime:11:16:58Log-Likelihood:-inf

converged:TrueLL-Null:-4.6237e+05LLR p-value:1.000==============================================================================coef std err z P>|z|[0.0250.975]------------------------------------------------------------------------------lngdp_-0.95040.245-3.8720.000-1.431-0.469lnpop0.51050.1283.9750.0000.2590.762sxp16.77345.2063.2220.0016.56926.978sxp2-23.800410.040-2.3710.018-43.478-4.123gy1-0.09800.041-2.3620.018-0.179-0.017frac-0.00029.2e-05-2.6950.007-0.000-6.76e-05etdo45900.48010.3281.4630.144-0.1631.124geogia-0.99190.909-1.0910.275-2.7740.790peace-0.00380.001-3.8080.000-0.006-0.002intercept-3.43752.486-1.3830.167-8.3101.435==============================================================================

底部的系数，std err，p值等是正确的(我知道这是因为我有“解决方案”)。

但是，正如您所看到的Current function value is inf，我认为这是错误的。

我得到两个警告。显然statsmodels在某处执行np.exp(BIGNUMBER)，例如np.exp(999)和np.log(0)。

还有Pseudo R-squ. is -inf和Log-Likelihood is -inf，-inf我认为这不应该。

那我在做什么错？

编辑：

X.describe()：

lngdp_ lnpop sxp sxp2 gy1 \

count750.000000750.000000750.000000750.000000750.000000mean7.76694815.7021910.1553290.0438371.529772std1.0451211.6451540.1404860.0828383.546621min5.40267811.9002270.0020000.000004-13.08800025%6.88269414.7231230.0560000.003136-0.41125050%7.69621215.6809840.1110000.0123211.80100075%8.66935516.6529810.2030000.0412093.625750max9.85182620.9083540.9350000.87422514.409000frac etdo4590 geogia peace intercept

count750.000000750.000000750.000000750.000000750.0mean1812.7773330.4373330.600263348.2093331.0std1982.1060290.4963880.209362160.9419960.0min12.0000000.0000000.0000001.0000001.025%176.0000000.0000000.489250232.0000001.050%864.0000000.0000000.608000352.0000001.075%3375.0000001.0000000.763000472.0000001.0max6975.0000001.0000000.971000592.0000001.0

logit.loglikeobs(result.params)：

array([-4.61803704e+01,-2.26983454e+02,-2.66741244e+02,-2.60206733e+02,-4.75585266e+02,-1.76454554e+00,-4.86048292e-01,-8.02300533e-01,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-6.02780923e+02,-4.12209348e+02,-6.42901288e+02,-6.94331125e+02,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,...

(logit.exog * np.array(result.params))。min(0)：

array([-9.36347474,6.07506083,0.03354677,-20.80694575,-1.41162588,-1.72895247,0.,-0.9631801,-2.23188846,-3.4374963])

数据集：

解决方案

我很惊讶在这种情况下它仍然收敛。

当x值较大时，使用Logit或Poisson中使用的exp函数的溢出可能会导致收敛问题。通常可以通过调整回归变量的比例来避免这种情况。

但是，在这种情况下，我的猜测将是x的异常值。第6列的值类似于4134.0，而其他列则小得多。

您可以检查每个观测值的logit.loglikeobs(result.params)对数似然性，以查看哪些观测值可能导致问题，logit引用模型的名称在哪里

每个预测变量的贡献也可能会有所帮助，例如

np.argmax(np.abs(logit.exog * result.params), 0)

要么

(logit.exog * result.params).min(0)

如果只是一个或几个观察结果，那么删除它们可能会有所帮助。对exog进行重新缩放很可能对此无济于事，因为在收敛时，将仅通过对估计系数进行重新缩放来进行补偿。

Also check whether there is not an encoding error or a large value as place holder for missing values.

edit

Given that the number of -inf in loglikeobs seems to be large, I think that there might be a more fundamental problem than outliers, in the sense that the Logit model is not the correctly specified maximum likelihood model for this dataset.

Two possibilites in general (because I haven't seen the dataset):

Perfect separation: Logit assumes that the predicted probabilities stay away from zero and one. In some cases an explanatory variable or combination of them allows perfect prediction of the dependent variable. In this case the parameters are either not identified or go to plus or minus infinity. The actual parameter estimates depend on the convergence criteria for the optimization. Statsmodels Logit detects some cases for this and then raises and PerfectSeparation exception, but it doesn't detect all cases with partial separation.

Logit or GLM-Binomial are in the one parameter linear exponential family. The parameter estimates in this case only depend on the specified mean function and the implied variance. It does not require that the likelihood function is correctly specified. So it is possible to get good (consistent) estimates even if the likelihood function is not correct for the given dataset. In this case the solution is a quasi-maximum likelihood estimator, but the loglikelihood value is invalid.

This can have the effect that the results in terms of convergence and numerical stability depend on the computational details for how the edge or extreme cases are handled. Statsmodels is clipping the values to keep them away from the bounds in some cases but not yet everywhere.

The difficulty is in figuring out what to do about numerical problems and to avoid returning "some" numbers without warning the user when the underlying model is inappropriate for or incompatible with the data.

Maybe llf = -inf is the "correct" answer in this case, and any finite numbers are just approximation for -inf. Maybe it's just a numerical problem because of the way the functions are implemented in double precision.