# 1. Motivation

Softmax Regression主要应用于多标签分类，它的主要作用是将多个标量映射为一个概率分布。

N featuresK labels

# 2. Introduction

y(i){0,1}

(x(1),y(1)),...,(x(m),y(m))

hθ(x)=11+exp(θTx)

P(y=0|x;θ)=11+exp(θTx)

P(y=1|x;θ)=exp(θTx)1+exp(θTx)

J(θ)=[i=1my(i)loghθ(x(i))+(1y(i))log(1hθ(x(i)))]

θ=argmaxθJ(θ)

# 3. Equation

## 3.2. Probability Estimation

hθ(x)=P(y=1|x;θ)P(y=K|x;θ)=1Kj=1exp(θ(j)Tx)exp(θ(1)Tx)exp(θ(K)Tx)

θ=|θ(1)||θ(2)||θ(K)|

## 3.3. Cost Function

J(θ)=[i=1mK=1K1{y(i)=k}logP(y=K|x;θ)]=i=1mK=1K1{y(i)=k}logexp(θ(k)Tx)Kj=1exp(θ(j)Tx)

θ(i):=θ(i)θ(i)J(θ)

θ(k)J(θ)=i=1m[x(i)(1{y(i)=k}P(y(i)=k|x;θ))]

Algorithm1. Softmax Regression

(1) 初始化参数θ(i)$\theta^{(i)}$，对i{1,2,...,K}$i \in \{1,2,...,K\}$θ(k)=0n×1$\theta^{(k)}=0^{n \times 1}$

(2) 对i=1,...,#training$i=1,...,\# \textrm{training}$

j=1,...,K$j=1,...,K$

θ(k):=θ(k)θ(k)J(θ)

θ(k)J(θ)=i=1m[x(i)(1{y(i)=k}P(y(i)=k|x;θ))]

(3) 如果不是所有的θ(k)$\theta^{(k)}$都收敛，重复步骤2。

# 4. Reference

[1] 李航. 统计学习方法[J]. 2012.

[2] Andrew Ng, et al. Softmax Regression. UFLDL Tutorial. http://ufldl.stanford.edu/tutorial/

# 5. Appendix

## 5.1. Cost function求梯度

J(θ)=i=1mK=1K1{y(i)=k}logexp(θ(k)Tx)Kj=1exp(θ(j)Tx)

L(θ)=K=1K1{y(i)=k}logexp(θ(k)Tx)Kj=1exp(θ(j)Tx)

L(θ)=1{y(i)=k}logexp(θ(k)Tx)Kj=1exp(θ(j)Tx)1{y(i)=a}logexp(θ(a)Tx)Kj=1exp(θ(j)Tx)k=1,...,a1,a+1,...,Kk=a

θ(a)L1(θ)=====1{y(i)=k}θ(a)logexp(θ(k)Tx(i))Kj=1exp(θ(j)Tx(i))1{y(i)=k}Kj=1exp(θ(j)Tx(i))exp(θ(k)Tx(i))θ(a)exp(θ(k)Tx(i))Kj=1exp(θ(j)Tx(i))1{y(i)=k}Kj=1exp(θ(j)Tx(i))exp(θ(k)Tx(i))exp(θ(k)Tx(i))(Kj=1exp(θ(j)Tx(i)))2xexp(θ(a)Tx(i))1{y(i)=k}exp(θ(a)Tx(i))(Kj=1exp(θ(j)Tx(i)))x(i)1{y(i)=k}x(i)(P(y(i)=k|x;θ))

θ(a)L2(θ)=====1{y(i)=k}θ(a)logexp(θ(a)Tx(i))Kj=1exp(θ(j)Tx(i))1{y(i)=k}Kj=1exp(θ(j)Tx(i))exp(θ(a)Tx(i))θ(a)exp(θ(a)Tx(i))Kj=1exp(θ(j)Tx(i))1{y(i)=k}Kj=1exp(θ(j)Tx(i))exp(θ(a)Tx(i))x(i)exp(θ(a)Tx(i))Kj=1exp(θ(j)Tx(i))x(i)exp(θ(a)Tx(i))2(Kj=1exp(θ(j)Tx(i)))21{y(i)=k}x(i)Kj=1exp(θ(j)Tx(i))x(i)exp(θ(a)Tx(i))Kj=1exp(θ(j)Tx(i))1{y(i)=k}x(i)(1P(y(i)=a|x(i);θ))

θ(k)J(θ)=i=1m[x(i)(1{y(i)=k}P(y(i)=k|x(i);θ))]

2015.12.16 于浙大.