chapter 5 Logistic Regression
Speech and Language Processing ed3 读书笔记
Generative and Discriminative Classifiers: The most important difference between naive Bayes and logistic regression is that logistic regression is a discriminative classifier while naive Bayes is a generative classifier.
A generative model like naive Bayes makes use of likelihood term, which expresses how to generate the features of a document if we knew it was of class c. By contrast a discriminative model in this text categorization scenario attempts to directly compute P ( c ∣ d ) P(c|d) P(c∣d).
A probability machine learning system for classification has four components:
- A feature representation of the input. For each input observation x ( i ) x^{(i)} x(i), this will be a vector of features [ x 1 , x 2 , … , x n ] [x_1,x_2,\ldots, x_n] [x1,x2,…,xn]. We will generally refer to feature i i i for input x ( j ) x^{( j)} x(j) as x i ( j ) x_i^{( j)} xi(j), sometimes simplified as x i x_i xi, but we will also see the notation f i , f i ( x ) f_i, f_i(x) fi,fi(x), or, for multiclass classification, f i ( c , x ) f_i(c,x) fi(c,x).
- A classification function that computes y ^ \hat y y^, the estimated class, via p ( y ∣ x ) p(y|x) p(y∣x). In the next section we will introduce the sigmoid and softmax tools for classification.
- An objective function for learning, usually involving minimizing error on training examples. We will introduce the cross-entropy loss function.
- An algorithm for optimizing the objective function. We introduce the stochastic gradient descent algorithm.
Logistic regression has two phases:
training: we train the system (specifically the weights w w w and b b b) using stochastic gradient descent and the cross-entropy loss.
test: Given a test example x x x we compute p ( y ∣ x ) p(y|x) p(y∣x) and return the higher probability label y = 1 y = 1 y=1 or y = 0 y = 0 y=0.
5.1 Classification: the sigmoid
Logistic regression learns, from a training set, a vector of weights w w w and a bias term b b b.
To make a decision on a test instance
x
x
x,
z
=
w
⋅
x
+
b
z=w\cdot x+b
z=w⋅x+b
Since weights are real-valued, the output might even be negative;
z
z
z ranges from
−
∞
-\infin
−∞ to
∞
\infin
∞. To create a probability
P
(
+
∣
x
)
P(+|x)
P(+∣x) that
x
x
x is classified as positive, we’ll pass
z
z
z through the sigmoid function,
σ
(
z
)
\sigma (z)
σ(z). During the process of training, we will use a objective function to characterize the error between the estimated value
y
^
\hat y
y^ and the golden labels
y
y
y. In logistic regression,
σ
(
z
)
\sigma(z)
σ(z) also serves as the estimated value. As
y
^
\hat y
y^ is a real number between 0 and 1, we maps golden labels + and - into real numbers 1 and 0.
P
(
+
∣
x
)
=
P
(
Y
=
1
∣
x
)
=
y
^
=
σ
(
z
)
=
1
1
+
e
−
z
P(+|x)=P(Y=1|x)=\hat y=\sigma(z)=\frac{1}{1+e^{-z}}
P(+∣x)=P(Y=1∣x)=y^=σ(z)=1+e−z1
P ( − ∣ x ) = P ( Y = 0 ∣ x ) = 1 − y ^ = 1 − σ ( z ) = 1 − σ ( w ⋅ x + b ) P(-|x)=P(Y=0|x)=1-\hat y=1-\sigma(z)=1-\sigma(w\cdot x+b) P(−∣x)=P(Y=0∣x)=1−y^=1−σ(z)=1−σ(w⋅x+b)
5.1.1 Example: sentiment classification
Designing features: Features are generally designed by examining the training set with an eye to linguistic intuitions and the linguistic literature on the domain. A careful error analysis on the training or dev set. of an early version of a system often provides insights into features.
- combination features or feature interactions
- feature templates
- representation learning: ways to learn features automatically in an unsupervised way from the input.
Choosing a classifier: Logistic regression has a number of advantages over naive Bayes. Naive Bayes has overly strong conditional independence assumptions. Consider two features which are strongly correlated; in fact, imagine that we just add the same feature f 1 f_1 f1 twice.
When there are many correlated features, logistic regression will assign a more accurate probability than naive Bayes. So logistic regression generally works better on larger documents or datasets and is a common default. Despite the less accurate probabilities, naive Bayes still often makes the correct classification decision. Furthermore, naive Bayes works extremely well (even better than logistic regression) on very small datasets (Ng and Jordan, 2002) or short documents (Wang and Manning, 2012). Furthermore, naive Bayes is easy to implement and very fast to train (there’s no optimization step). So it’s still a reasonable approach to use in some situations.
5.2 Learning in Logistic Regression
loss function or the cost function: cross-entropy loss
gradient descent: stochastic gradient descent
5.3 The cross-entropy loss function
L ( y ^ , y ) = How much y ^ differs from the true y L(\hat y ,y)=\textrm{How much }\hat y\textrm{ differs from the true }y L(y^,y)=How much y^ differs from the true y
L MSE ( y ^ , y ) = 1 2 ( y ^ − y ) 2 L_{\textrm{MSE}}(\hat y,y)=\frac{1}{2}(\hat y-y)^2 LMSE(y^,y)=21(y^−y)2
It turns out that this MSE loss, which is very useful for some algorithms like linear regression, becomes harder to optimize (technically, non-convex), when it’s applied to probabilistic classification.
Instead, we use a loss function that prefers the correct class labels of the training example to be more likely. This is called conditional maximum likelihood estimation: we choose the parameters w , b w, b w,b that maximize the log probability of the true y y y labels in the training data given the observations x x x. The resulting loss function is the negative log likelihood loss, generally called the cross entropy loss.
Let’s derive this loss function, applied to a single observation
x
x
x. We’d like to learn weights that maximize the probability of the correct label
p
(
y
∣
x
)
p(y|x)
p(y∣x). Since there are only two discrete outcomes (1 or 0), this is a Bernoulli distribution, and we can express the probability density function
p
(
y
∣
x
)
p(y|x)
p(y∣x) that our classifier produces for one observation by combining Eq.2 and Eq.3:
p
(
y
∣
x
)
=
y
^
y
(
1
−
y
^
)
1
−
y
p(y|x)=\hat y^y(1-\hat y)^{1-y}
p(y∣x)=y^y(1−y^)1−y
To facilitate differential,
log
p
(
y
∣
x
)
=
log
[
y
^
y
(
1
−
y
^
)
1
−
y
]
=
y
log
y
^
+
(
1
−
y
)
log
(
1
−
y
^
)
\log p(y|x)=\log[\hat y^y(1-\hat y)^{1-y}]=y\log\hat y+(1-y)\log(1-\hat y)
logp(y∣x)=log[y^y(1−y^)1−y]=ylogy^+(1−y)log(1−y^)
In order to turn this into loss function (something that we need to minimize), we’ll just flip the sign. The result is the cross-entropy loss
L
C
E
L_{CE}
LCE:
L
C
E
=
−
log
p
(
y
∣
x
)
=
−
[
y
log
y
^
+
(
1
−
y
)
log
(
1
−
y
^
)
]
=
−
[
y
log
σ
(
w
⋅
x
+
b
)
+
(
1
−
y
)
log
(
1
−
σ
(
w
⋅
x
+
b
)
)
]
L_{CE}=-\log p(y|x)=-[y\log\hat y+(1-y)\log(1-\hat y)]\\ =-[y\log\sigma(w\cdot x+b)+(1-y)\log(1-\sigma(w\cdot x+b))]
LCE=−logp(y∣x)=−[ylogy^+(1−y)log(1−y^)]=−[ylogσ(w⋅x+b)+(1−y)log(1−σ(w⋅x+b))]
Why does minimizing this negative log probability do what we want? A perfect classifier would assign probability 1 to the correct outcome (y=1 or y=0) and probability 0 to the incorrect outcome. That means the higher
y
^
\hat y
y^ (the closer it is to 1), the better the classifier; the lower
y
^
\hat y
y^ is (the closer it is to 0), the worse the classifier. The negative log of this probability is a convenient loss metric since it goes from 0 (negative log of 1, no loss) to infinity (negative log of 0, infinite loss). This loss function also insures that as probability of the correct answer is maximized, the probability of the incorrect answer is minimized; since the two sum to one, any increase in the probability of the correct answer is coming at the expense of the incorrect answer. It’s called the cross-entropy loss, because it is also the formula for the cross-entropy between the true probability distribution
y
y
y and our estimated distribution
y
^
\hat y
y^.
Let’s now extend it from one example to the whole training set: we’ll continue to use the notation that
x
(
i
)
x^{(i)}
x(i) and
y
(
i
)
y^{(i)}
y(i) mean the ith training features and training label, respectively. We make the assumption that the training examples are independent:
log
p
(
training labels
)
=
log
∏
i
=
1
m
p
(
y
(
i
)
∣
x
(
i
)
)
=
∑
i
=
1
m
log
p
(
y
(
i
)
∣
x
(
i
)
)
=
−
∑
i
=
1
m
L
C
E
(
y
^
(
i
)
,
y
(
i
)
)
\log p(\textrm{training labels}) = \log \prod_{i=1}^m p(y^{(i)}|x^{(i)}) =\sum_{i=1}^m\log p(y^{(i)}|x^{(i)})=−\sum_{i=1}^m L_{CE}(\hat y^{(i)},y^{(i)})
logp(training labels)=logi=1∏mp(y(i)∣x(i))=i=1∑mlogp(y(i)∣x(i))=−i=1∑mLCE(y^(i),y(i))
We’ll define the cost function for the whole dataset as the average loss for each example:
C
o
s
t
(
w
,
b
)
=
1
m
∑
i
=
1
m
L
C
E
(
y
^
(
i
)
,
y
(
i
)
)
=
−
1
m
∑
i
=
1
m
y
(
i
)
log
σ
(
w
⋅
x
(
i
)
+
b
)
+
(
1
−
y
(
i
)
)
log
(
1
−
σ
(
w
⋅
x
(
i
)
+
b
)
)
Cost(w,b) = \frac{1}{m}\sum_{i=1}^m L_{CE}(\hat y^{(i)},y^{(i)})\\ =- \frac{1}{m}\sum_{i=1}^m y^{(i)}\log\sigma(w\cdot x^{(i)}+b)+(1-y^{(i)})\log(1-\sigma(w\cdot x^{(i)}+b))
Cost(w,b)=m1i=1∑mLCE(y^(i),y(i))=−m1i=1∑my(i)logσ(w⋅x(i)+b)+(1−y(i))log(1−σ(w⋅x(i)+b))
5.4 Gradient Descent
Our goal with gradient descent is to find the optimal weights: minimize the loss function we’ve defined for the model. We’ll explicitly represent the fact that the loss function L is parameterized by the weights, which we’ll refer to in machine learning in general as
θ
\theta
θ (in the case of logistic regression
θ
=
w
,
b
\theta = w,b
θ=w,b):
θ
^
=
argmin
θ
1
m
∑
i
=
1
m
L
C
E
(
y
(
i
)
,
x
(
i
)
;
θ
)
\hat \theta=\mathop{\textrm{argmin}}_{\theta}\frac{1}{m}\sum_{i=1}^m L_{CE}(y^{(i)},x^{(i)};\theta)
θ^=argminθm1i=1∑mLCE(y(i),x(i);θ)
In each dimension
w
i
w_i
wi (plus the bias
b
b
b), we express the slope as a partial derivative
∂
∂
w
i
\frac{\partial}{\partial w_i}
∂wi∂ of the loss function. The gradient is then defined as a vector of these partials. We’ll represent
y
^
\hat y
y^ as
f
(
x
;
θ
)
f (x;\theta)
f(x;θ) to make the dependence on
θ
\theta
θ more obvious:
∇
θ
L
(
f
(
x
;
θ
)
,
y
)
=
[
∂
∂
w
1
L
(
f
(
x
;
θ
)
,
y
)
∂
∂
w
2
L
(
f
(
x
;
θ
)
,
y
)
⋮
∂
∂
w
n
L
(
f
(
x
;
θ
)
,
y
)
]
\nabla_\theta L(f(x;\theta),y)=\left[ \begin{array}{c} \frac{\partial}{\partial w_1}L(f(x;\theta),y)\\ \frac{\partial}{\partial w_2}L(f(x;\theta),y)\\ \vdots\\ \frac{\partial}{\partial w_n}L(f(x;\theta),y)\\ \end{array} \right]
∇θL(f(x;θ),y)=⎣⎢⎢⎢⎡∂w1∂L(f(x;θ),y)∂w2∂L(f(x;θ),y)⋮∂wn∂L(f(x;θ),y)⎦⎥⎥⎥⎤
The final equation for updating q based on the gradient is thus
θ
t
+
1
=
θ
t
−
η
∇
θ
(
f
(
x
;
θ
)
,
y
)
\theta_{t+1}=\theta_t-\eta\nabla_\theta(f(x;\theta),y)
θt+1=θt−η∇θ(f(x;θ),y)
5.4.1 The Gradient for Logistic Regression
∂ L C E ( w , b ) ∂ w j = [ σ ( w ⋅ x + b ) − y ] x j \frac{\partial L_{CE}(w,b)}{\partial w_j}=[\sigma(w\cdot x+b)-y]x_j ∂wj∂LCE(w,b)=[σ(w⋅x+b)−y]xj
This makes use of the function 1fg which evaluates to 1{} if the condition in the brackets is true and to 0 otherwise.
∂
C
o
s
t
(
w
,
b
)
∂
w
j
=
∑
i
=
1
m
[
σ
(
w
⋅
x
(
i
)
+
b
)
−
y
(
i
)
]
x
j
(
i
)
\frac{\partial Cost(w,b)}{\partial w_j}=\sum_{i=1}^m[\sigma(w\cdot x^{(i)}+b)-y^{(i)}]x_j^{(i)}
∂wj∂Cost(w,b)=i=1∑m[σ(w⋅x(i)+b)−y(i)]xj(i)
5.4.2 The Stochastic Gradient Descent Algorithm
import numpy as np
n_epochs=50
t0,t1=5,50
n_features=10
def learning_schedule(t):
return t0/(t+t1)
theta=np.random.rand(n_features,1)
for epoch in range(n_epochs):
for i in range(m):
random_index = np.random.randint(m)
xi=X_train[random_index:random_index+1]
#Note that using slicing here is to ensure the shape of xi is (n_features,1), and the shape of transpose of xi is (1,n_features)
yi=y[random_index:random_index+1]
gradients=2*xi.T.dot(xi.dot(theta)-yi)
eta=learning_schedule(epoch*m +i)
theta=theta-eta*gradients
Stochastic gradient descent is called stochastic because it chooses a single random example at a time, moving the weights so as to improve performance on that single example. That can result in very choppy movements, so it’s also common to do minibatch gradient descent, which computes the gradient over batches of training instances rather than a single instance.
The learning rate η \eta η is a parameter that must be adjusted. If it’s too high, the learner will take steps that are too large, overshooting the minimum of the loss function. If it’s too low, the learner will take steps that are too small, and take too long to get to the minimum. It is most common to begin the learning rate at a higher value, and then slowly decrease it, so that it is a function of the iteration k k k of training; you will sometimes see the notation η k \eta_k ηk to mean the value of the learning rate at iteration k k k.
5.4.3 Working through an example
∂ L C E ( w , b ) ∂ b = σ ( w ⋅ x + b ) − y \frac{\partial L_{CE}(w,b)}{\partial b}=\sigma(w\cdot x+b)-y ∂b∂LCE(w,b)=σ(w⋅x+b)−y
5.5 Regularization
To avoid overfitting, a regularization term is added to the objective function, resulting in the following objective:
w
^
=
argmax
w
[
∑
i
=
1
m
log
p
(
y
(
i
)
∣
x
(
i
)
)
−
α
R
(
w
)
]
\hat w = \mathop{\textrm{argmax}}_w \left[\sum_{i=1}^m \log p(y^{(i)}|x^{(i)})-\alpha R(w)\right]
w^=argmaxw[i=1∑mlogp(y(i)∣x(i))−αR(w)]
The new component,
R
(
w
)
R(w)
R(w) is called a regularization term, and is used to penalize large weights. Thus a setting of the weights that matches the training data perfectly, but uses many weights with high values to do so, will be penalized more than a setting that matches the data a little less well, but does so using smaller weights.
There are two common regularization terms
R
(
w
)
R(w)
R(w). L2 regularization is a quadratic function of the weight values, named because it uses the (square of the) L2 norm of the weight values. The L2 norm,
∣
∣
W
∣
∣
2
||W||_2
∣∣W∣∣2, is the same as the Euclidean distance:
R
(
w
)
=
∣
∣
w
∣
∣
2
2
=
∑
j
=
1
N
w
j
2
R(w) = ||w||_2^ 2 =\sum_{j=1}^N w_j^2
R(w)=∣∣w∣∣22=j=1∑Nwj2
The L2 regularized objective function becomes:
w
^
=
argmax
w
[
∑
i
=
1
m
log
P
(
y
(
i
)
∣
x
(
i
)
)
−
α
∑
j
=
1
N
w
j
2
]
\hat w = \mathop{\textrm{argmax}}_w \left[\sum_{i=1}^m \log P(y^{(i)}|x^{(i)})-\alpha \sum_{j=1}^N w_j^2\right]
w^=argmaxw[i=1∑mlogP(y(i)∣x(i))−αj=1∑Nwj2]
L1 regularization is a linear function of the weight values, named after the L1 norm
∣
∣
W
∣
∣
1
||W||_1
∣∣W∣∣1, the sum of the absolute values of the weights, or Manhattan distance (the Manhattan distance is the distance you’d have to walk between two points in a city with a street grid like New York):
R
(
w
)
=
∣
∣
W
∣
∣
1
=
∑
j
=
1
N
∣
w
j
∣
R(w) = ||W||_1 = \sum_{j=1}^N |w_j|
R(w)=∣∣W∣∣1=j=1∑N∣wj∣
The L1 regularized objective function becomes:
w
^
=
argmax
w
[
∑
i
=
1
m
log
p
(
y
(
i
)
∣
x
(
i
)
)
−
α
∑
j
=
1
N
∣
w
j
∣
]
\hat w = \mathop{\textrm{argmax}}_w \left[\sum_{i=1}^m \log p(y^{(i)}|x^{(i)})-\alpha \sum_{j=1}^N |w_j|\right]
w^=argmaxw[i=1∑mlogp(y(i)∣x(i))−αj=1∑N∣wj∣]
These kinds of regularization come from statistics, where L1 regularization is called the ‘lasso’ or lasso regression (Tibshirani, 1996) and L2 regression is called ridge regression, and both are commonly used in language processing. L2 regularization is easier to optimize because of its simple derivative (the derivative of
w
2
w^2
w2 is just
2
w
2w
2w), while L1 regularization is more complex (the derivative of |w| is noncontinuous at zero). But where L2 prefers weight vectors with many small weights, L1 prefers sparse solutions with some larger weights but many more weights set to zero. Thus L1 regularization leads to much sparser weight vectors, that is, far fewer features.
Both L1 and L2 regularization have Bayesian interpretations as constraints on the prior of how weights should look. L1 regularization can be viewed as a Laplace prior on the weights. L2 regularization corresponds to assuming that weights are distributed according to a gaussian distribution with mean
μ
=
0
\mu = 0
μ=0. In a gaussian or normal distribution, the further away a value is from the mean, the lower its probability (scaled by the variance
σ
\sigma
σ). By using a gaussian prior on the weights, we are saying that weights prefer to have the value 0. A gaussian for a weight
w
j
w_j
wj is
1
2
π
σ
j
2
exp
(
−
(
w
j
−
μ
j
)
2
2
σ
j
2
)
\frac{1}{\sqrt{2\pi\sigma_j^2}}\exp\left(-\frac{(w_j-\mu_j)^2}{2\sigma_j^2}\right)
2πσj21exp(−2σj2(wj−μj)2)
If we multiply each weight by a gaussian prior on the weight, we are thus maximizing the following constraint:
w
^
=
argmax
w
[
∏
i
=
1
m
p
(
y
(
i
)
∣
x
(
i
)
)
×
∏
j
=
1
N
1
2
π
σ
j
2
exp
(
−
(
w
j
−
μ
j
)
2
2
σ
j
2
)
]
\hat w = \mathop{\textrm{argmax}}_w \left[\prod_{i=1}^m p(y^{(i)}|x^{(i)})\times \prod_{j=1}^N\frac{1}{\sqrt{2\pi\sigma_j^2}}\exp\left(-\frac{(w_j-\mu_j)^2}{2\sigma_j^2}\right)\right]
w^=argmaxw⎣⎡i=1∏mp(y(i)∣x(i))×j=1∏N2πσj21exp(−2σj2(wj−μj)2)⎦⎤
which in log space, with
μ
=
0
\mu = 0
μ=0, and assuming
2
σ
2
=
1
2\sigma^ 2 = 1
2σ2=1, corresponds to
w
^
=
argmax
w
[
∑
i
=
1
m
log
p
(
y
(
i
)
∣
x
(
i
)
)
−
α
∑
j
=
1
N
w
j
2
]
\hat w = \mathop{\textrm{argmax}}_w \left[\sum_{i=1}^m \log p(y^{(i)}|x^{(i)})-\alpha \sum_{j=1}^N w_j^2\right]
w^=argmaxw[i=1∑mlogp(y(i)∣x(i))−αj=1∑Nwj2]
5.6 Multinomial logistic regression
Multinomial logistic regression, also called softmax regression (or, historically, the maxent classifier). In multinomial logistic regression the target y y y is a variable that ranges over more than two classes; we want to know the probability of y y y being in each potential class c ∈ C c \in C c∈C, p ( y = c ∣ x ) p(y = c|x) p(y=c∣x).
The multinomial logistic classifier uses a generalization of the sigmoid, called the softmax function, to compute the probability
p
(
y
=
c
∣
x
)
p(y = c|x)
p(y=c∣x). The softmax function takes a vector
z
=
[
z
1
,
z
2
,
…
z
k
]
z = [z_1,z_2,\ldots z_k]
z=[z1,z2,…zk] of
k
k
k arbitrary values and maps them to a probability distribution, with each value in the range (0,1], and all the values summing to 1. Like the sigmoid, it is an exponential function.
softmax
(
z
i
)
=
e
z
i
∑
j
=
1
k
e
z
j
,
1
≤
i
≤
k
\textrm{softmax}(z_i)=\frac{e^{z_i}}{\sum_{j=1}^k e^{z_j}}, 1\le i\le k
softmax(zi)=∑j=1kezjezi,1≤i≤k
The softmax of an input vector
z
=
[
z
1
,
z
2
,
…
z
k
]
z = [z_1,z_2,\ldots z_k]
z=[z1,z2,…zk] is thus a vector itself:
softmax
(
z
)
=
[
e
z
1
∑
j
=
1
k
e
z
j
,
e
z
2
∑
j
=
1
k
e
z
j
,
…
,
e
z
k
∑
j
=
1
k
e
z
j
]
\textrm{softmax}(z)=\left[\frac{e^{z_1}}{\sum_{j=1}^k e^{z_j}},\frac{e^{z_2}}{\sum_{j=1}^k e^{z_j}},\ldots, \frac{e^{z_k}}{\sum_{j=1}^k e^{z_j}}\right]
softmax(z)=[∑j=1kezjez1,∑j=1kezjez2,…,∑j=1kezjezk]
The probability for each class
c
c
c is
p
(
y
=
c
∣
x
)
=
e
w
c
⋅
x
+
b
c
∑
j
=
1
k
e
w
j
⋅
x
+
b
j
p(y=c|x)=\frac{e^{w_c\cdot x+b_c}}{\sum_{j=1}^k e^{w_j\cdot x+b_j}}
p(y=c∣x)=∑j=1kewj⋅x+bjewc⋅x+bc
5.6.1 Features in Multinomial Logistic Regression
For multiclass classification the input features need to be a function of both the observation x x x and the candidate output class c c c. we will use the notation f i ( c , x ) f_i(c,x) fi(c,x), meaning feature i i i for a particular class c c c for a given observation x x x.
5.6.2 Learning in Multinomial Logistic Regression
The loss function for a single example
x
x
x is the sum of the logs of the
K
K
K output classes:
L
C
E
(
y
^
,
y
)
=
−
∑
k
=
1
K
1
{
y
=
k
}
log
p
(
y
=
k
∣
x
)
=
−
∑
k
=
1
K
1
{
y
=
k
}
log
e
w
k
⋅
x
+
b
k
∑
j
=
1
K
e
w
j
⋅
x
+
b
j
L_{CE}(\hat y,y)=-\sum_{k=1}^K 1\{y=k\}\log p(y=k|x)=-\sum_{k=1}^K 1\{y=k\}\log\frac{e^{w_k\cdot x+b_k}}{\sum_{j=1}^K e^{w_j\cdot x+b_j}}
LCE(y^,y)=−k=1∑K1{y=k}logp(y=k∣x)=−k=1∑K1{y=k}log∑j=1Kewj⋅x+bjewk⋅x+bk
∂ L C E ∂ w k = ( 1 { y = k } − p ( y = k ∣ x ) ) x k = ( 1 { y = k } − e w k ⋅ x + b k ∑ j = 1 K e w j ⋅ x + b j ) x k \frac{\partial L_{CE}}{\partial w_k}=(1\{y=k\}- p(y=k|x))x_k=\left(1\{y=k\}-\frac{e^{w_k\cdot x+b_k}}{\sum_{j=1}^K e^{w_j\cdot x+b_j}}\right)x_k ∂wk∂LCE=(1{y=k}−p(y=k∣x))xk=(1{y=k}−∑j=1Kewj⋅x+bjewk⋅x+bk)xk
5.7 Interpreting models
We want our decision to be interpretable. Because the features to logistic regression are often human-designed, one way to understand a classifier’s decision is to understand the role each feature it plays in the decision. Logistic regression can be combined with statistical tests (the likelihood ratio test, or the Wald test); investigating whether a particular feature is significant by one of these tests, or inspecting its magnitude (how large is the weight w w w associated with the feature?) can help us interpret why the classifier made the decision it makes. This is enormously important for building transparent models.
Furthermore, in addition to its use as a classifier, logistic regression in NLP and many other fields is widely used as an analytic tool for testing hypotheses about the effect of various explanatory variables (features). In such cases, logistic regression allows us to test whether some feature is associated with some outcome above and beyond the effect of other features.
5.8 Advanced: Deriving the Gradient Equation
5.9 Summary
This chapter introduced the logistic regression model of classification.
- Logistic regression is a supervised machine learning classifier that extracts real-valued features from the input, multiplies each by a weight, sums them, and passes the sum through a sigmoid function to generate a probability. A threshold is used to make a decision.
- Logistic regression can be used with two classes (e.g., positive and negative sentiment) or with multiple classes (multinomial logistic regression, for example for n-ary text classification, part-of-speech labeling, etc.).
- Multinomial logistic regression uses the softmax function to compute probabilities. The weights (vector w w w and bias b b b) are learned from a labeled training set via a loss function, such as the cross-entropy loss, that must be minimized.
- Minimizing this loss function is a convex optimization problem, and iterative algorithms like gradient descent are used to find the optimal weights.
- Regularization is used to avoid overfitting.
- Logistic regression is also one of the most useful analytic tools, because of its ability to transparently study the importance of individual features.