python 逻辑回归 summary_python – 为什么statsmodels和R之间的逻辑回归结果不同?

我试图比较

python的statsmodels和R中的逻辑回归实现.

import statsmodels.api as sm

import pandas as pd

import pylab as pl

import numpy as np

df = pd.read_csv("http://www.ats.ucla.edu/stat/data/binary.csv")

df.columns = list(df.columns)[:3] + ["prestige"]

# df.hist()

# pl.show()

dummy_ranks = pd.get_dummies(df["prestige"], prefix="prestige")

cols_to_keep = ["admit", "gre", "gpa"]

data = df[cols_to_keep].join(dummy_ranks.ix[:, "prestige_2":])

data["intercept"] = 1.0

train_cols = data.columns[1:]

logit = sm.Logit(data["admit"], data[train_cols])

result = logit.fit()

result.summary2()

结果:

Results: Logit

=================================================================

Model: Logit Pseudo R-squared: 0.083

Dependent Variable: admit AIC: 470.5175

Date: 2014-12-19 01:11 BIC: 494.4663

No. Observations: 400 Log-Likelihood: -229.26

Df Model: 5 LL-Null: -249.99

Df Residuals: 394 LLR p-value: 7.5782e-08

Converged: 1.0000 Scale: 1.0000

No. Iterations: 6.0000

------------------------------------------------------------------

Coef. Std.Err. z P>|z| [0.025 0.975]

------------------------------------------------------------------

gre 0.0023 0.0011 2.0699 0.0385 0.0001 0.0044

gpa 0.8040 0.3318 2.4231 0.0154 0.1537 1.4544

prestige_2 -0.6754 0.3165 -2.1342 0.0328 -1.2958 -0.0551

prestige_3 -1.3402 0.3453 -3.8812 0.0001 -2.0170 -0.6634

prestige_4 -1.5515 0.4178 -3.7131 0.0002 -2.3704 -0.7325

intercept -3.9900 1.1400 -3.5001 0.0005 -6.2242 -1.7557

=================================================================

R版:

data = read.csv("http://www.ats.ucla.edu/stat/data/binary.csv", head=T)

require(reshape2)

data1 = dcast(data, admit + gre + gpa ~ rank)

require(dplyr)

names(data1)[4:7] = paste("rank", 1:4, sep="")

data1 = data1[, -4]

summary(glm(admit ~ gre + gpa + rank2 + rank3 + rank4, family=binomial, data=data1))

结果:

Call:

glm(formula = admit ~ gre + gpa + rank2 + rank3 + rank4, family = binomial,

data = data1)

Deviance Residuals:

Min 1Q Median 3Q Max

-1.5133 -0.8661 -0.6573 1.1808 2.0629

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -4.184029 1.162421 -3.599 0.000319 ***

gre 0.002358 0.001112 2.121 0.033954 *

gpa 0.770591 0.343908 2.241 0.025046 *

rank2 -0.369711 0.310342 -1.191 0.233535

rank3 -1.015012 0.335147 -3.029 0.002457 **

rank4 -1.249251 0.414416 -3.014 0.002574 **

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 466.13 on 377 degrees of freedom

Residual deviance: 434.12 on 372 degrees of freedom

AIC: 446.12

Number of Fisher Scoring iterations: 4

结果完全不同,例如,rank_2的p值分别为0.03和0.2.我想知道这种差异的原因是什么?请注意,我为这两个版本创建了虚拟变量,并为python版本创建了一个常量列,它在R中自动处理.

此外,似乎python速度提高了2倍:

##################################################

# python timing

def test():

for i in range(5000):

logit = sm.Logit(data["admit"], data[train_cols])

result = logit.fit(disp=0)

import time

start = time.time()

test()

print(time.time() - start)

10.099738836288452

##################################################

# R timing

> f = function() for(i in 1:5000) {mod = glm(admit ~ gre + gpa + rank2 + rank3 + rank4, family=binomial, data=data1)}

> system.time(f())

user system elapsed

17.505 0.021 17.526

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值