Machine learning system design
Prioritizing what to work on: Spam classification example(确认执行的有先级)
How to spend your time to make it have low error?
- Collect lots of data
- Develop sophisticated features based on email routing information(from email header).
- Develop sophisticated features for message body, e.g. should “discount” and “discounts”
- Develop sophisticated algorithm to detect misspellings(e.g.m0rtage)
Error analysis(误差分析)
**Recommended approach **
- Start with a simple algorithm that you can implement quickly .Implement it and test it on your cross-validation data.
- Plot learning curves to decide if more data, more features, etc .are likely to help.
- Error analysis: Manually examine the examples(in corss validation set)that your algorithm made errors on.See if you spot any systematic trend in what type of examples it is making errors on.
- Test way:Manually examine the 100 errors, and categorize them based on the type and what features you think would have helped the algorihm classify them correctly
The importance of numerical evaluation
- should discount/discounts/discounted/discounting be treated as the same word?
- Error analysis may not be helpful for deciding if this is likely to improve performance. Only solution is to try it and see if it works
Error metrics for skewed classes(对于偏斜类误差度量)
skewed classes(偏斜类)
If you have very skewed classes it becomes mush harder to use just classification accuracy, because you can get very high classification accuracies or very low errors, and it’s not always clear if doing so is really improving the quality of your classifier because prediciting y equals 0 all the time doesn,t seem like a particularly good classifier.
Precision/Recall(查准率和召回率)
y = 1 y = 1 y=1in presence of rare class that we want to detect
Actual 1 | class 0 |
---|---|
1True positive | False positive |
0False negative | True negative |
Precision(of all patients where we predicted
y
=
1
y=1
y=1,what fraction actually has cancer?)
True positives/predicted positives = True positive/(True positive +False positive
Recall(of all patients that actually have cancer, what fraction did we correctly detect as having cancer? ) a high recall would be a good thing
True positives/actual positive = True positives /True pos +False neg
Trading off precision and recall(折衷精度和召回率)
How to compare precision/recall numbers?
to calculate
F
1
F_1
F1 Score:=2(PR)/P+R
P=1 or R=1 FScore = 1 is perfect
Data for machine learning
Large data rationale(大数据基本原理)
- Use a learning algorithm with many parameters(e.g. logistic regression/linear regression with many features; neural network with many hidden units).
- use a very large training set (Unlikely to overfit)
Support Vector Machines(SVM)
Optimization objective
sometime gives a cleaner and sometime gives a powerful way of learning complex nonlinear functions
h
Θ
=
1
/
(
1
+
e
−
θ
T
x
)
h_{\Theta} = 1/(1+e^{-{\theta^T}x})
hΘ=1/(1+e−θTx)
If
y
=
1
y=1
y=1,we want
h
θ
(
x
)
≈
1
,
θ
T
>
>
0
h_\theta(x)\approx1,\theta^T>>0
hθ(x)≈1,θT>>0
If
y
=
0
y=0
y=0,we want
h
θ
(
x
)
≈
0
,
θ
T
<
<
0
h_\theta(x)\approx0,\theta^T<<0
hθ(x)≈0,θT<<0
SVM hypothesis
m
i
n
C
∑
i
=
1
m
[
y
(
i
)
c
o
s
t
t
−
1
(
θ
T
x
(
i
)
)
+
(
1
−
y
(
i
)
c
o
s
t
0
(
θ
T
x
(
i
)
)
)
]
+
1
2
∑
i
=
1
m
θ
j
2
minC\displaystyle \sum^{m}_{i=1}{[y^{(i)}costt-1(\theta^Tx^{(i)})+(1-y^{(i)}cost_0(\theta^Tx^{(i)}))]}+\frac{1}{2}\displaystyle \sum^{m}_{i=1}\theta^2_j
minCi=1∑m[y(i)costt−1(θTx(i))+(1−y(i)cost0(θTx(i)))]+21i=1∑mθj2
Hypothesis:
h
θ
(
x
)
=
1
h_\theta(x)=1
hθ(x)=1 if
θ
T
≥
0
\theta^T\geq0
θT≥0
h
θ
(
x
)
=
0
h_\theta(x)=0
hθ(x)=0 if otherwise
Large Margin Intuition(直观上大间隔的理解)
- margin间距
The Mathematics behind large margin classification(optional)大间隔分类器的数学原理
- vector inner products向量内积
- ∣ ∣ u ∣ ∣ = l e n g t h o f v e c t o r u ||u||=length of vector u ∣∣u∣∣=lengthofvectoru
- p = Length of projection of v vector u (signed)
- u T v = p ∗ ∣ ∣ u ∣ ∣ = u 1 v 1 + u 2 v 2 u^Tv=p*||u||=u_1v_1+u_2v_2 uTv=p∗∣∣u∣∣=u1v1+u2v2
-
m
i
n
1
2
∑
j
=
1
m
θ
j
2
=
1
2
(
θ
1
2
+
θ
2
2
)
2
min\frac{1}{2}\displaystyle \sum^{m}_{j=1}\theta^2_j= \frac{1}{2}(\sqrt{\theta_1^2+\theta_2^2})^2
min21j=1∑mθj2=21(θ12+θ22)2=
1
2
∣
∣
θ
∣
∣
2
\frac{1}{2}||\theta||^2
21∣∣θ∣∣2
s.t… - θ T x ( i ) ≥ 1 \theta^Tx^(i)\geq1 θTx(i)≥1 if y ( i ) = 1 y^{(i)}=1 y(i)=1
- θ T x ( i ) ≤ 1 \theta^Tx^(i)\leq1 θTx(i)≤1 if y ( i ) = 0 y^{(i)}=0 y(i)=0
simplification: Θ 0 = 0 , n = 2 \Theta_0=0, n=2 Θ0=0,n=2
If you look at this hypothesis , we want the projections of my positive and negative examples onto θ \theta θ to be large and the only way for that to hold true is if surrounding the line, to keep a large margin
this is why this support vector machine ends up with enlarge margin classifiers because it is trying to maximize the norm of these P1 which is the distance from the training example to the decision boundary
if θ = 0 \theta = 0 θ=0,we are entertaining only decision boundaries pass through the origin .
Kernels I
this is actually called a Gaussian kernel(particular choice of similarity function )similarity function
Kernels II
C
(
=
1
λ
)
C(=\frac{1}{\lambda})
C(=λ1)
Large C:Lower bias, high variance
Small C:Higher bias, low variance
σ
2
\sigma^2
σ2
Large
σ
2
\sigma^2
σ2:Features
f
i
f_i
fivary smoothly (Higher bias, lower variance)
Small
σ
2
\sigma^2
σ2:Features
f
i
f_i
fivary less smoothly (Lower bias, higher variance)
Mercer’s Theorem(默塞尔定理)
Note: Not all similarity functions s i m i l a r i t y ( x , l ) similarity(x,l) similarity(x,l) make valid kernels.(Need to satisfy technical called “Mercer’s Theorem” to make sure SVM package’s optimizations run correctly, and do not diverge).