Naive Bayes
It is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors.
P
(
c
∣
x
)
=
P
(
x
∣
c
)
P
(
c
)
P
(
x
)
P(c|x) = \frac{P(x|c)P(c)}{P(x)}
P(c∣x)=P(x)P(x∣c)P(c), where c is the target and x is attributes
P
(
c
∣
x
1
,
⋯
,
x
n
)
=
P
(
x
1
∣
y
)
P
(
x
n
∣
y
)
P
(
y
)
P
(
x
1
)
P
(
x
2
)
⋯
P
(
x
n
)
P(c|x_1,\cdots,x_n) = \frac{P(x_1|y)P(x_n|y)P(y)}{P(x_1)P(x_2)\cdots P(x_n)}
P(c∣x1,⋯,xn)=P(x1)P(x2)⋯P(xn)P(x1∣y)P(xn∣y)P(y)
P ( y ∣ x 1 , ⋯ , x n ) ∝ P ( y ) Π i = 1 n P ( x i ∣ y ) P(y|x_1,\cdots,x_n) \propto P(y)\Pi^n_{i=1}P(x_i|y) P(y∣x1,⋯,xn)∝P(y)Πi=1nP(xi∣y)
y = arg max y P ( y ) Π i = 1 n P ( x i ∣ y ) y=\argmax_y P(y)\Pi^n_{i=1}P(x_i|y) y=yargmaxP(y)Πi=1nP(xi∣y)
AutoEncoder
Basic architecture, An autoencoder has two main parts: an encoder that maps the input into the code, and a decoder that maps the code to a reconstruction of the input.
ϕ
:
X
→
F
\phi : X \to F
ϕ:X→F
ψ
:
F
→
X
\psi : F \to X
ψ:F→X
ϕ
,
ψ
=
arg min
ϕ
,
ψ
∣
∣
X
−
(
ψ
∘
ϕ
)
X
∣
∣
2
\phi,\psi = \argmin_{\phi,\psi} ||X-(\psi\circ\phi)X||^2
ϕ,ψ=ϕ,ψargmin∣∣X−(ψ∘ϕ)X∣∣2
Knowledge Distill
Use a smaller model on production when using a bigger one in training. For multi-class task, we use smaller model to learn the softmax output from the bigger model.
There are lots of tricks for knowledge distill, like tuning the temperature.