Logistic Regression with Python

Logistic回归的似然角度

我们有一组I.I.D的数据 ( x i , y i ) i = 1 n (x_i,y_i)_{i=1}^n (xi,yi)i=1n,其中 y i y_i yi 是二分类的结果变量。令 x i x_i xi 是fixed的,我后面用 ⋅ ∣ x \cdot|x x 来表示 x x x 没有随机性。因为 y y y 是0-1二元变量,所以我假设下面的model:
y i ∣ x i ∼ Bernoulli ⁡ ( f ( x i ) ) . y_i|x_i\sim\operatorname{Bernoulli}(f(x_i)). yixiBernoulli(f(xi)).那么自然有 f ( x i ) = E [ y i ∣ x i ] = pr ⁡ ( y i = 1 ∣ x i ) f(x_i)=\mathrm{E}[y_i|x_i]=\operatorname{pr}(y_i=1|x_i) f(xi)=E[yixi]=pr(yi=1xi) f ( x ) = σ ( x τ β + α ) = expit ⁡ ( x τ β + α ) = 1 1 + exp ⁡ ( − x τ β − α ) f(x)=\sigma(x^\tau\beta+\alpha)=\operatorname{expit}(x^\tau\beta+\alpha)=\frac{1}{1+\exp(-x^\tau\beta-\alpha)} f(x)=σ(xτβ+α)=expit(xτβ+α)=1+exp(xτβα)1 对应 Logistic Model, f ( x ) = Φ ( x τ β + α ) = ∫ − ∞ x τ β + α 1 2 π exp ⁡ ( − 1 2 s 2 ) d s f(x)=\Phi(x^\tau\beta+\alpha)=\int_{-\infty}^{x^\tau\beta+\alpha}\frac{1}{\sqrt{2\pi}}\exp(-\frac{1}{2}s^2)ds f(x)=Φ(xτβ+α)=xτβ+α2π 1exp(21s2)ds 对应 Probit Model。简便起见,我后面用 w = x τ β + α w=x^\tau\beta+\alpha w=xτβ+α。在上述伯努利模型下,随机数据 y i y_i yi 的概率密度 (pdf) 是 g ( t ∣ x i ) = pr ⁡ ( y i = t ∣ x i ) = f ( x i ) t ( 1 − f ( x i ) ) 1 − t g(t|x_i)=\operatorname{pr}(y_i=t|x_i)=f(x_i)^{t}(1-f(x_i))^{1-t} g(txi)=pr(yi=txi)=f(xi)t(1f(xi))1t。因此 f f f 的似然函数 (likelihood) 为
L n ( f , y 1 , ⋯   , y n ∣ x 1 , ⋯   , x n ) = ∏ i = 1 n g ( y i ∣ x i ) = ∏ i = 1 n f ( x i ) y i ( 1 − f ( x i ) ) 1 − y i . L_n(f,y_1,\cdots,y_n|x_1,\cdots,x_n)=\prod_{i=1}^ng(y_i|x_i)=\prod_{i=1}^nf(x_i)^{y_i}(1-f(x_i))^{1-y_i}. Ln(f,y1,,ynx1,,xn)=i=1ng(yixi)=i=1nf(xi)yi(1f(xi))1yi.对数似然为
log ⁡ L n ( f ) = ∑ i = 1 n { y i log ⁡ f ( x i ) 1 − f ( x i ) + log ⁡ ( 1 − f ( x i ) ) } . \log L_n(f)=\sum_{i=1}^n \left\{y_i\log\frac{f(x_i)}{1-f(x_i)}+\log(1-f(x_i))\right\}. logLn(f)=i=1n{yilog1f(xi)f(xi)+log(1f(xi))}. f f f 的选择应该要满足 f ∈ [ 0 , 1 ] f\in [0,1] f[0,1],比如我们上面提到的 Logistic Model 和 Probit Model。如果取逻辑模型, log-likelihood 变为
log ⁡ L n ( β , α ) = ∑ i = 1 n { y i w i + log ⁡ ( 1 + exp ⁡ ( w i ) ) } . \log L_n({\beta,\alpha})=\sum_{i=1}^n \left\{y_iw_i+\log(1+\exp(w_i))\right\}. logLn(β,α)=i=1n{yiwi+log(1+exp(wi))}.在 Logistic Model 下, ( β , α ) (\beta,\alpha) (β,α) 的极大似然估计为
( β ^ , α ^ ) = arg max ⁡ β , α log ⁡ L n ( β , α ) . (\hat{\beta},\hat{\alpha})=\argmax_{\beta,\alpha}\log L_n({\beta,\alpha}). (β^,α^)=β,αargmaxlogLn(β,α).写成极小化问题:
arg min ⁡ β , α log ⁡ − L n ( β , α ) . \argmin_{\beta,\alpha}\log -L_n({\beta,\alpha}). β,αargminlogLn(β,α).优化函数 (Objective function) 关于 β \beta β 的 hessian matrix 为
H β = − ∂ 2 L n ( β , α ) ∂ β ∂ β τ = ∑ i = 1 n x i x i τ ( 1 + exp ⁡ ( w i ) ) . H_{\beta}=\frac{-\partial^2L_n(\beta,\alpha)}{\partial \beta\partial \beta^\tau}=\sum_{i=1}^n\frac{x_ix_i^\tau}{(1+\exp(w_i))}. Hβ=ββτ2Ln(β,α)=i=1n(1+exp(wi))xixiτ.显然 H β H_{\beta} Hβ 是一个正定矩阵,因此 − L n ( β , α ) -L_n({\beta,\alpha}) Ln(β,α) 关于 β \beta β 是一个严格凸函数,我们可以用 Newton-Ralphson 算法找到该极小化问题的全局极小点。至于 α \alpha α,可以使用一样的技巧,或者令 x ~ = ( 1 , x ) \tilde{x}=(1,x) x~=(1,x) γ = ( α , β ) \gamma=(\alpha,\beta) γ=(α,β),这样 w = x ~ τ γ w=\tilde{x}^\tau\gamma w=x~τγ,再对 γ \gamma γ 求解即可。 L n L_n Ln 关于 β \beta β 梯度下降的方向为
∂ ∂ β L n ( β , α ) = ∑ i = 1 n { y i − 1 1 + exp ⁡ ( − w i ) } x i . \frac{\partial}{\partial \beta}L_n(\beta,\alpha)=\sum_{i=1}^n\left\{y_i-\frac{1}{1+\exp(-w_i)}\right\}x_i. βLn(β,α)=i=1n{yi1+exp(wi)1}xi.

Newton-Ralphson算法

x ~ = ( 1 , x ) \tilde{x}=(1,x) x~=(1,x) γ = ( α , β ) \gamma=(\alpha,\beta) γ=(α,β)

Step 1: 设置参数初值 k = 0 k=0 k=0 γ ^ o l d = γ ( 0 ) \hat{\gamma}_{old}=\gamma^{(0)} γ^old=γ(0),阈值 ϵ \epsilon ϵ,最大迭代次数 N m a x N_{max} Nmax

Step 2: 更新参数, γ ^ n e w = γ ^ o l d + H γ ( γ ^ o l d ) − 1 ∂ ∂ γ L n ( γ ^ o l d ) \hat{\gamma}_{new}=\hat{\gamma}_{old}+H_{\gamma}(\hat{\gamma}_{old})^{-1}\frac{\partial}{\partial \gamma}L_n(\hat{\gamma}_{old}) γ^new=γ^old+Hγ(γ^old)1γLn(γ^old)

Step 3: 计算检查是否有 ∣ ∣ ∂ ∂ γ L n ( γ ^ n e w ) ∣ ∣ < ϵ ||\frac{\partial}{\partial \gamma}L_n(\hat{\gamma}_{new})||<\epsilon γLn(γ^new)<ϵ (或 ∣ ∣ γ ^ n e w − γ ^ o l d ∣ ∣ < ϵ ||\hat{\gamma}_{new}-\hat{\gamma}_{old}||<\epsilon γ^newγ^old<ϵ) 或者 k ≥ N m a x k\geq N_{max} kNmax,如果有,break out 并输出 γ ^ n e w \hat{\gamma}_{new} γ^new;否则,令 k = k + 1 k=k+1 k=k+1 γ ^ o l d = γ ^ n e w \hat{\gamma}_{old}=\hat{\gamma}_{new} γ^old=γ^new,返回 Step 2。

预测

得到估计 γ ^ n \hat{\gamma}_n γ^n 后,我们就得到了 f f f 的估计 f ^ n ( x ) = σ ( x ~ τ γ ^ n ) \hat{f}_n(x)=\sigma(\tilde{x}^\tau\hat{\gamma}_n) f^n(x)=σ(x~τγ^n)。这样,对于一组无标签的新的数据 x n e w {x}_{new} xnew, 但我们可以计算 pr ⁡ ^ ( y n e w = 1 ∣ x n e w ) = f ^ ( x n e w ) \hat{\operatorname{pr}}(y_{new}=1|x_{new})=\hat{f}(x_{new}) pr^(ynew=1xnew)=f^(xnew)。如果 pr ⁡ ^ ( y n e w = 1 ∣ x n e w ) > 0.5 \hat{\operatorname{pr}}(y_{new}=1|x_{new})>0.5 pr^(ynew=1xnew)>0.5,我们认为 y n e w = 1 y_{new}=1 ynew=1 的几率更大,因此预测应为 y ^ n e w = 1 \hat{y}_{new}=1 y^new=1。即
y ^ ( x ) = 1 f ^ n ( x ) > 0.5 . \hat{y}(x)=1_{\hat{f}_n(x)> 0.5}. y^(x)=1f^n(x)>0.5.

大样本性质

假设上述关于 model 的假设是正确的,即 y y y 确实来自于 Bernoulli ⁡ ( σ ( x τ γ 0 ) ) \operatorname{Bernoulli}(\sigma(x^\tau\gamma_0)) Bernoulli(σ(xτγ0)),真实参数为 γ 0 \gamma_0 γ0。并且,我们还假设 x x x 也是随机的,I.I.D来自于随机变量 X X X。 统计学家一般会关心 γ ^ \hat{\gamma} γ^ 的渐近性质,然而数据科学家更关心的是分类的泛化误 (generalization error),i.e.,
R ( f ^ n ) = E [ y ^ ( X ) ≠ 1 σ ( X τ γ 0 ) > 0.5 ] . R(\hat{f}_n)=\mathrm{E}[\hat{y}(X)\neq 1_{\sigma(X^\tau\gamma_0)>0.5}]. R(f^n)=E[y^(X)=1σ(Xτγ0)>0.5].一般来说,他们总是希望当样本量足够大的时候 R ( f ^ n ) < δ R(\hat{f}_n)<\delta R(f^n)<δ 能以很高的概率成立,或者说是否能找到这么一个不等式: pr ⁡ ( R ( f ^ ) < δ ( n ) ) > 1 − ϵ ( n ) \operatorname{pr}(R(\hat{f})<\delta(n))>1-\epsilon(n) pr(R(f^)<δ(n))>1ϵ(n)

多分类逻辑回归

当结果变量的类别数大于2的时候 (假设有 K K K个,对应 label 为 1 , 2 , ⋯   , K 1,2,\cdots,K 1,2,,K),我们扩展 model 为
g ( t ∣ x ) = pr ⁡ ( y = t ∣ x ) = ∏ i = 1 K f i ( x ) 1 i = t g(t|x)=\operatorname{pr}(y=t|x)=\prod_{i=1}^Kf_i(x)^{1_{i=t}} g(tx)=pr(y=tx)=i=1Kfi(x)1i=t对于 t = 1 , ⋯   , K t=1,\cdots,K t=1,,K。其中, f i ( x ) = exp ⁡ ( x ~ τ γ i ) ∑ j = 1 K exp ⁡ ( x ~ τ γ j ) f_i(x)=\frac{\exp(\tilde{x}^\tau\gamma_i)}{\sum_{j=1}^K\exp(\tilde{x}^\tau\gamma_j)} fi(x)=j=1Kexp(x~τγj)exp(x~τγi)。这个时候,likelihood 为
L n ( γ 1 , ⋯   , γ K ) = ∏ i = 1 n g ( y i ∣ x i ) = ∏ i = 1 n ∏ k = 1 K f k ( x i ) 1 y i = k . L_n(\gamma_1,\cdots,\gamma_K)=\prod_{i=1}^n g(y_i|x_i)=\prod_{i=1}^n\prod_{k=1}^Kf_k(x_i)^{1_{y_i=k}}. Ln(γ1,,γK)=i=1ng(yixi)=i=1nk=1Kfk(xi)1yi=k.Log-likelihood 为
log ⁡ L n = ∑ i = 1 n ∑ k = 1 K 1 y i = k ⋅ log ⁡ f k ( x i ) . \log L_n=\sum_{i=1}^n\sum_{k=1}^K1_{y_i=k}\cdot\log f_k(x_i). logLn=i=1nk=1K1yi=klogfk(xi).我们可以通过极大化对数似然函数来求解模型,i.e.,
( γ ^ 1 , ⋯   , γ ^ K ) = arg min ⁡ γ 1 , ⋯   , γ K − log ⁡ L n ( γ 1 , ⋯   , γ K ) . (\hat{\gamma}_1,\cdots,\hat{\gamma}_K)=\argmin_{\gamma_1,\cdots,\gamma_K}-\log L_n(\gamma_1,\cdots,\gamma_K). (γ^1,,γ^K)=γ1,,γKargminlogLn(γ1,,γK).
同样地, L n ( γ 1 , ⋯   , γ K ) L_n(\gamma_1,\cdots,\gamma_K) Ln(γ1,,γK) 关于所有参数都是严格凸的,我们可以通过牛顿法来求解上述优化问题。

真实数据研究 - Speed Dating数据集

数据简介

Speed Dating (闪电约会) 数据集可以在网站 link 上下载,文章 Morgan et al. 20141专门对该数据集进行了调查。该数据集收集于2002-2004,实验目的是研究异性相互吸引的因素。研究对象为哥伦比亚大学的学生,一共有276个男性和276个女性。每一轮,每个人都会随机和一名异性进行4分钟的约会,然后彼此填写问卷来评价彼此。总共进行10-20轮。我们只取第一轮的数据进行研究,因此一共有276个 observations ,每一个 observation 都代表其中两名异性的一轮4分钟约会。由于每个 observation 中约会的异性都不一样,因此 observations 之间是独立的。Morgan et al. 20141总结了数据集中的变量:

CovariateDescription
Decision1 = Yes (want to see the date again), 0 = No (do not want to see date again)
LikeOverall, how much do you like this person? (1 = don’t like at all, 10 = like a lot)
PartnerYesHow probable do you think it is that this person will say ‘yes’ for you? (1=not probable, 10=extremely probable)
AgeAge
RaceRace (Caucasian, Asian, Black, Latino, or Other)
AttractiveRate attractiveness of partner on a scale of 1–10 (1 = awful, 10=great)
SincereRate sincerity of partner on a scale of 1–10 (1 = awful, 10=great)
IntelligentRate intelligence of partner on a scale of 1–10 (1 = awful, 10=great)
FunRate how fun partner is on a scale of 1–10 (1 = awful, 10=great)
AmbitiousRate ambition of partner on a scale of 1–10 (1 = awful, 10=great)
Shared InterestsRate the extent to which you share interests/hobbies with partner on a scale of 1–10 (1 = awful, 10=great)

其中,变量 Decision 表示是否愿意和该异性再次约会,它衡量了异性之间对彼此的满意程度,我们可以把它作为异性吸引力的一个指标。比如,DecisionF (女性的决定) 可以衡量男性的性吸引力。其他的一些变量则衡量了异形间对彼此的评价,它可以衡量对方在某一方面能力的强弱。比如,FunF (女性对男性有趣程度的评价) 衡量了某 observation 中该男性的有趣程度,Shared InterestF 则衡量了该次约会中男性是否有趣。

读取数据 (with python)

加载所需的library:

# libraries #
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import dexplot 
import itertools
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler

用 pandas 中的 read_csv 读取数据:

# load data 
df_speed = pd.read_csv("SpeedDating.csv")

数据可视化 (with python)

看一下数据前10行长什么样:

# a glimpse of data 
df_speed.head(10)

在这里插入图片描述
查看一些简单的统计量:

# simple statistics summary 
df_speed.describe()

在这里插入图片描述
有些变量有缺失值,我们后面分类的时候再讨论怎么解决。现在,我们先画个箱线图,以 DecisionF 为横轴分组,查看其它变量在 DecisionF == 1 和 DecisionF == 0 中的分布:

# Boxplot of columns splitted by Decisoins of Females
fig, axs = plt.subplots(2,4,figsize=(20,  10), sharey=False, sharex = True)
df_speed.boxplot(column = ['LikeF','PartnerYesF','AgeM','AttractiveF','SincereF','IntelligentF','FunF','AmbitiousF'], 
                               by = 'DecisionF', ax = axs)

在这里插入图片描述
可以发现,在愿意再次约会的组中,男性更有吸引力 (AttractiveF),更为有趣 (FunF),女性也更喜欢该男性 (LikeF),其他指标像年龄和真诚则没有明显的差距。一定程度上说明吸引力和有趣程度是性吸引力的重要因素。同样地,我们可以看看男性对女性的评价:

fig, axs = plt.subplots(2,4,figsize=(20,  10), sharey=False, sharex = True)
# df_speed.boxplot(column = ['LikeF','PartnerYesF','AgeM','AttractiveF','SincereF','IntelligentF','FunF','AmbitiousF'], 
#                               by = 'DecisionF', ax = axs)
df_speed.boxplot(column = ['LikeM','PartnerYesM','AgeF','AttractiveM','SincereM','IntelligentM','FunM','AmbitiousM'], 
                               by = 'DecisionM', ax = axs)

在这里插入图片描述
同样地,女性的吸引力也很大程度上影响男性是否愿意再次约会。我们看一下在不同的族裔中,DecisionF==1 的比例是否有差别:

Inst_variable = 'RaceM'
decision_counts = (df_speed.groupby(['DecisionF'])[Inst_variable].value_counts())
c = decision_counts.unstack()
c.fillna(0,inplace=True)
round(c.iloc[1]/(c.iloc[1]+c.iloc[0]),2)

在这里插入图片描述
可以发现白人中的赞同率是要高于其他族裔的,我们可以画个柱状图更为直观:

# Barplot for counts of DecisionF which equal to 1
dexplot.count(Inst_variable, data=df_speed, split='DecisionF')

在这里插入图片描述
亚洲和拉丁人的通过率要低一些。同样地,我们还可以研究不同有趣程度里的满意度:

Inst_variable = 'FunF'
# Ratios of DecisionF==1 splitted by FunF
decision_counts = (df_speed.groupby(['DecisionF'])[Inst_variable].value_counts())
c = decision_counts.unstack()
c.fillna(0,inplace=True)
round(c.iloc[1]/(c.iloc[1]+c.iloc[0]),2)
# Barplot for counts of DecisionF which equal to 1
dexplot.count(Inst_variable, data=df_speed, split='DecisionF', stacked = True)
FunFRatio
1.00.00
2.00.00
3.00.09
4.00.17
5.00.29
6.00.35
7.00.60
8.00.59
9.00.65
10.00.89
dtype: float64

在这里插入图片描述
可以发现,男性越有趣,女性同意再次约会的概率越大。 我们把 (FunF, AttractiveF) 作为变量画个散点图,方法1:

dexplot.scatter(data = df_speed, x = "FunF", y = "AttractiveF" , split = "DecisionF")

在这里插入图片描述
方法2,用 pyplot 画:

# plot by pyplot
fig, ax = plt.subplots(figsize=(10, 10))
plt.rcParams['font.size'] = '20'
ax.scatter(df_speed["FunF"][df_speed.DecisionF==1],df_speed["AttractiveF"][df_speed.DecisionF==1],
           marker = 'D', color = 'red', label = 'Yes')
ax.scatter(df_speed["FunF"][df_speed.DecisionF==0],df_speed["AttractiveF"][df_speed.DecisionF==0],
           marker = 'o', color = 'blue', label = 'No')
plt.xlabel('FunF',fontsize=16)
plt.ylabel('AttractiveF',fontsize=16)
plt.xticks(fontsize=16)
plt.yticks(fontsize=16)
plt.legend(fontsize=16)
plt.show()

在这里插入图片描述

分类 (with python)

二分类

我们先把 DecisionF 作为输出,这是一个二分类问题。用于预测的变量我们考虑

{‘AgeM’,‘AttractiveF’,‘SincereF’,‘IntelligentF’,‘FunF’,‘AmbitiousF’}

也就是男性的年龄,吸引力,真诚,智力,有趣,野心。我们把数据分割为训练集和测试集

# split data
x, y = df_speed[['AgeM','AttractiveF','SincereF','IntelligentF','FunF','AmbitiousF']], df_speed["DecisionF"]
x_train, x_test, y_train, y_test =\
    train_test_split(x, y, test_size=0.2, random_state=0)

检查缺失值:

# check missing values
print("missing counts for x_train:")
print(x_train.apply(lambda x: sum(x.isnull()),axis=0))
print("missing counts for y_train:", sum(y_train.isnull()))
print("missing counts for x_test:")
print(x_test.apply(lambda x: sum(x.isnull()),axis=0))
print("missing counts for y_test:", sum(y_test.isnull()))

在这里插入图片描述
可以看到 y 没有缺失,但是训练集和测试集中用于预测的协变量存在少量缺失。对待缺失值,一般来说有下面一些方法:

  • 用没有缺失值的个体进行分析和建模。
  • 通过回归或者其他方法插值 (imputation)。
  • Model-based 方法,用 EM 算法解决优化问题。
  • 把缺失值建模进分类器中。

我们这里用一个简单的方法,我们用每个变量的中位数 (median) 插值该列的缺失值:

# fill missing values with medians of columns
x_train_mdfill = x_train.fillna(x_train.median())
x_test_mdfill = x_test.fillna(x_test.median())

建立第一个模型 model1,没有惩罚项的普通的 logistic regression model。由于数据量较小,我猜测不加正则化的效果会更好。

# fit logistic model
model1 = LogisticRegression(penalty = 'none', solver='newton-cg', C=0.05, multi_class='auto',random_state=0)
model1.fit(x_train_mdfill, y_train)

训练好模型,我们来看一下训练精度和测试精度:

# model evaluation
print("Accuracy of train data:", round(model1.score(x_train_mdfill, y_train),4)*100,"%")
print("Accuracy of test data:", round(model1.score(x_test_mdfill, y_test),4)*100,"%")

Accuracy of train data: 71.82 %
Accuracy of test data: 75.0 %

准确率很一般,正常。计算 confusion matrix,看一看一类和二类错误率 :

# predict y_test and calculate confusion matrix
y_pred = model1.predict(x_test_mdfill)
cm1 = confusion_matrix(y_test, y_pred)
print(cm1)

[[20 5]
[ 9 22]]

画个图,先定义好 confusion matrix 作图的函数:

# define cm-heatmap function
def plot_confusion_matrix(cm, title='Confusion matrix', cmap=plt.cm.Oranges, font_size = 5):
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title, fontsize=font_size)
    plt.colorbar(fraction=0.046, pad=0.04)
    tick_marks = np.arange(cm.shape[1])
    plt.xticks(tick_marks, fontsize=font_size)
    ax = plt.gca()
    # ax.set_xticklabels((ax.get_xticks() +1).astype(str))
    plt.yticks(tick_marks, fontsize=font_size)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j]),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black", fontsize=font_size)

    plt.tight_layout()
    plt.ylabel('True label', fontsize=font_size)
    plt.xlabel('Predicted label', fontsize=font_size)

Confusion matrix 可视化:

# cm heat plot
np.set_printoptions(precision=1)
print('Confusion matrix, without normalization')
fig, ax = plt.subplots(figsize = (5,5))
plot_confusion_matrix(cm1, font_size = 15)

在这里插入图片描述
可以发现,假阴性为9有点高。我们试一下加上 L 1 L_1 L1 正则项的效果:

# model with l1 penalty
model2 = LogisticRegression(penalty = 'l1', solver='liblinear', C=0.2, multi_class='auto',random_state=0)
model2.fit(x_train_mdfill, y_train)
print("Accuracy of train data:", round(model2.score(x_train_mdfill, y_train),4)*100,"%")
print("Accuracy of test data:", round(model2.score(x_test_mdfill, y_test),4)*100,"%")
y_pred = model2.predict(x_test_mdfill)
cm2 = confusion_matrix(y_test, y_pred)
print(cm2)
print('Confusion matrix, without normalization')
fig, ax = plt.subplots(figsize = (5,5))
plot_confusion_matrix(cm2, font_size = 15)

在这里插入图片描述
不如不加。试一下 L 2 L_2 L2 :

# model with l2 penalty
model3 = LogisticRegression(penalty = 'l2', solver='liblinear', C=0.2, multi_class='auto',random_state=0)
model3.fit(x_train_mdfill, y_train)
print("Accuracy of train data:", round(model3.score(x_train_mdfill, y_train),4)*100,"%")
print("Accuracy of test data:", round(model3.score(x_test_mdfill, y_test),4)*100,"%")
y_pred = model3.predict(x_test_mdfill)
cm3 = confusion_matrix(y_test, y_pred)
print(cm3)
print('Confusion matrix, without normalization')
fig, ax = plt.subplots(figsize = (5,5))
plot_confusion_matrix(cm3, font_size = 15)

在这里插入图片描述
L 1 L_1 L1 好一点点,比不加差一点点,总的来说加不加没啥区别。

多分类

同样的数据集,我们可以考虑另一个多分类问题,我们把 AttractiveF 作为输出,把 {‘AgeM’,‘RaceM’,‘SincereF’,‘IntelligentF’,‘FunF’,‘AmbitiousF’}作为预测变量,也就是说我们看一下能不能通过年龄,真诚,智力,有趣,野心来预测男性对女性的吸引力。当然,这个问题是很困难甚至不太科学的,因为每人心里自有一杆秤,人为填写的 Attractive score 也比较随意,不太可能能够预测。在这个多分类问题中,由于 y 也有缺失,我们直接把 train 和 test data 中的却是个体删掉只用 complete cases 来分析:

# multi-class
# split data
df_speed_clt = df_speed[['AgeM','RaceM','SincereF','IntelligentF','FunF','AmbitiousF','AttractiveF']].dropna()
x, y = df_speed_clt[['AgeM','SincereF','IntelligentF','FunF','AmbitiousF']], df_speed_clt["AttractiveF"].astype('int64')
x_train, x_test, y_train, y_test =\
    train_test_split(x, y, test_size=0.2, random_state=0)
# fit logistic model
model4 = LogisticRegression(penalty = 'l1', solver='saga', C=0.05, multi_class='auto',random_state=0)
model4.fit(x_train, y_train)
# model evaluation
print("Accuracy of train data:", round(model4.score(x_train, y_train),4)*100,"%")
print("Accuracy of test data:", round(model4.score(x_test, y_test),4)*100,"%")
# predict y_test and calculate confusion matrix
y_pred = model4.predict(x_test)
cm4 = confusion_matrix(y_test, y_pred)
print(cm4)
print('Confusion matrix, without normalization')
fig, ax = plt.subplots(figsize = (5,5))
plot_confusion_matrix(cm4, font_size = 10)

在这里插入图片描述
准确率感人,情理之中,多数预测错误的也在对角线附近。

回归 (with R)

R 的一个好处是它可以自动帮我们处理缺失值。我们用 R 来进行 logistic regression 并进行分析:

# load data
spd_dat <- read.csv("SpeedDating.csv")
# fit logistic model
spd_dat %>% 
  glm(DecisionF ~ AgeM + RaceM + AttractiveF + SincereF + 
                IntelligentF + FunF + AmbitiousF + SharedInterestsF, 
                family = binomial, data = .) %>% summary()
spd_dat %>% 
  glm(DecisionM ~ AgeF + RaceF + AttractiveM + SincereM + 
                IntelligentM + FunM + AmbitiousM + SharedInterestsM, 
                family = binomial, data = .) %>% summary()
Call:
glm(formula = DecisionF ~ AgeM + RaceM + AttractiveF + SincereF + 
    IntelligentF + FunF + AmbitiousF + SharedInterestsF, family = binomial, 
    data = .)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.8303  -0.8842  -0.2961   0.9048   2.7008  

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)      -5.183645   1.680642  -3.084 0.002040 ** 
AgeM             -0.014608   0.045496  -0.321 0.748138    
RaceMBlack        0.199424   0.852144   0.234 0.814965    
RaceMCaucasian    0.375810   0.401090   0.937 0.348774    
RaceMLatino      -0.956647   0.765242  -1.250 0.211254    
RaceMOther        0.802939   0.641082   1.252 0.210397    
AttractiveF       0.320312   0.105403   3.039 0.002374 ** 
SincereF          0.003909   0.127785   0.031 0.975594    
IntelligentF      0.010790   0.165924   0.065 0.948148    
FunF              0.266105   0.118125   2.253 0.024276 *  
AmbitiousF       -0.075996   0.121793  -0.624 0.532646    
SharedInterestsF  0.317134   0.090709   3.496 0.000472 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 325.11  on 235  degrees of freedom
Residual deviance: 243.73  on 224  degrees of freedom
  (40 observations deleted due to missingness)
AIC: 267.73

Number of Fisher Scoring iterations: 5
Call:
glm(formula = DecisionM ~ AgeF + RaceF + AttractiveM + SincereM + 
    IntelligentM + FunM + AmbitiousM + SharedInterestsM, family = binomial, 
    data = .)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.3710  -0.7227   0.2759   0.7742   1.8958  

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)      -1.98483    1.64890  -1.204    0.229    
AgeF             -0.03119    0.04462  -0.699    0.485    
RaceFBlack        0.30439    0.73366   0.415    0.678    
RaceFCaucasian   -0.55078    0.40281  -1.367    0.172    
RaceFLatino       0.46291    0.69276   0.668    0.504    
RaceFOther       -0.68209    0.75886  -0.899    0.369    
AttractiveM       1.05743    0.16602   6.369  1.9e-10 ***
SincereM         -0.17080    0.14148  -1.207    0.227    
IntelligentM     -0.27150    0.17674  -1.536    0.124    
FunM             -0.05941    0.13753  -0.432    0.666    
AmbitiousM       -0.13971    0.13522  -1.033    0.302    
SharedInterestsM  0.17794    0.09750   1.825    0.068 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 320.52  on 231  degrees of freedom
Residual deviance: 226.34  on 220  degrees of freedom
  (44 observations deleted due to missingness)
AIC: 250.34

Number of Fisher Scoring iterations: 5

从回归的结果可以发现,对女性来说,男性的吸引力,有趣,以及有相同的兴趣爱好正向且显著地影响了女性做出再次约会的决定。而对男性来说,只有女性的吸引力有显著的正向作用。

参考文献


  1. Taking a chance
    in the classroom: Speed dating: Exploring initial romantic attraction
    ↩︎ ↩︎

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值