9.25机械学习系统的设计、支持向量机

Machine learning system design

Prioritizing what to work on: Spam classification example(确认执行的有先级)

How to spend your time to make it have low error?

  • Collect lots of data
  • Develop sophisticated features based on email routing information(from email header).
  • Develop sophisticated features for message body, e.g. should “discount” and “discounts”
  • Develop sophisticated algorithm to detect misspellings(e.g.m0rtage)

Error analysis(误差分析)

**Recommended approach **

  • Start with a simple algorithm that you can implement quickly .Implement it and test it on your cross-validation data.
  • Plot learning curves to decide if more data, more features, etc .are likely to help.
  • Error analysis: Manually examine the examples(in corss validation set)that your algorithm made errors on.See if you spot any systematic trend in what type of examples it is making errors on.
  • Test way:Manually examine the 100 errors, and categorize them based on the type and what features you think would have helped the algorihm classify them correctly

The importance of numerical evaluation

  • should discount/discounts/discounted/discounting be treated as the same word?
  • Error analysis may not be helpful for deciding if this is likely to improve performance. Only solution is to try it and see if it works

Error metrics for skewed classes(对于偏斜类误差度量)

skewed classes(偏斜类)

If you have very skewed classes it becomes mush harder to use just classification accuracy, because you can get very high classification accuracies or very low errors, and it’s not always clear if doing so is really improving the quality of your classifier because prediciting y equals 0 all the time doesn,t seem like a particularly good classifier.

Precision/Recall(查准率和召回率)

y = 1 y = 1 y=1in presence of rare class that we want to detect

Actual 1class 0
1True positiveFalse positive
0False negativeTrue negative

Precision(of all patients where we predicted y = 1 y=1 y=1,what fraction actually has cancer?)
True positives/predicted positives = True positive/(True positive +False positive

Recall(of all patients that actually have cancer, what fraction did we correctly detect as having cancer? ) a high recall would be a good thing
True positives/actual positive = True positives /True pos +False neg

Trading off precision and recall(折衷精度和召回率)

How to compare precision/recall numbers?
to calculate F 1 F_1 F1 Score:=2(PR)/P+R
P=1 or R=1 FScore = 1 is perfect

Data for machine learning

Large data rationale(大数据基本原理)
  • Use a learning algorithm with many parameters(e.g. logistic regression/linear regression with many features; neural network with many hidden units).
  • use a very large training set (Unlikely to overfit)

Support Vector Machines(SVM)

Optimization objective

sometime gives a cleaner and sometime gives a powerful way of learning complex nonlinear functions
h Θ = 1 / ( 1 + e − θ T x ) h_{\Theta} = 1/(1+e^{-{\theta^T}x}) hΘ=1/(1+eθTx)

在这里插入图片描述
If y = 1 y=1 y=1,we want h θ ( x ) ≈ 1 , θ T > > 0 h_\theta(x)\approx1,\theta^T>>0 hθ(x)1,θT>>0
If y = 0 y=0 y=0,we want h θ ( x ) ≈ 0 , θ T < < 0 h_\theta(x)\approx0,\theta^T<<0 hθ(x)0,θT<<0

SVM hypothesis
m i n C ∑ i = 1 m [ y ( i ) c o s t t − 1 ( θ T x ( i ) ) + ( 1 − y ( i ) c o s t 0 ( θ T x ( i ) ) ) ] + 1 2 ∑ i = 1 m θ j 2 minC\displaystyle \sum^{m}_{i=1}{[y^{(i)}costt-1(\theta^Tx^{(i)})+(1-y^{(i)}cost_0(\theta^Tx^{(i)}))]}+\frac{1}{2}\displaystyle \sum^{m}_{i=1}\theta^2_j minCi=1m[y(i)costt1(θTx(i))+(1y(i)cost0(θTx(i)))]+21i=1mθj2

Hypothesis:
h θ ( x ) = 1 h_\theta(x)=1 hθ(x)=1 if θ T ≥ 0 \theta^T\geq0 θT0
h θ ( x ) = 0 h_\theta(x)=0 hθ(x)=0 if otherwise

Large Margin Intuition(直观上大间隔的理解)

  • margin间距

The Mathematics behind large margin classification(optional)大间隔分类器的数学原理

  • vector inner products向量内积
  • ∣ ∣ u ∣ ∣ = l e n g t h o f v e c t o r u ||u||=length of vector u u=lengthofvectoru
  • p = Length of projection of v vector u (signed)
  • u T v = p ∗ ∣ ∣ u ∣ ∣ = u 1 v 1 + u 2 v 2 u^Tv=p*||u||=u_1v_1+u_2v_2 uTv=pu=u1v1+u2v2
  • m i n 1 2 ∑ j = 1 m θ j 2 = 1 2 ( θ 1 2 + θ 2 2 ) 2 min\frac{1}{2}\displaystyle \sum^{m}_{j=1}\theta^2_j= \frac{1}{2}(\sqrt{\theta_1^2+\theta_2^2})^2 min21j=1mθj2=21(θ12+θ22 )2= 1 2 ∣ ∣ θ ∣ ∣ 2 \frac{1}{2}||\theta||^2 21θ2
    s.t…
  • θ T x ( i ) ≥ 1 \theta^Tx^(i)\geq1 θTx(i)1 if y ( i ) = 1 y^{(i)}=1 y(i)=1
  • θ T x ( i ) ≤ 1 \theta^Tx^(i)\leq1 θTx(i)1 if y ( i ) = 0 y^{(i)}=0 y(i)=0

simplification: Θ 0 = 0 , n = 2 \Theta_0=0, n=2 Θ0=0,n=2

If you look at this hypothesis , we want the projections of my positive and negative examples onto θ \theta θ to be large and the only way for that to hold true is if surrounding the line, to keep a large margin

this is why this support vector machine ends up with enlarge margin classifiers because it is trying to maximize the norm of these P1 which is the distance from the training example to the decision boundary

if θ = 0 \theta = 0 θ=0,we are entertaining only decision boundaries pass through the origin .

Kernels I

this is actually called a Gaussian kernel(particular choice of similarity function )similarity function

Kernels II

C ( = 1 λ ) C(=\frac{1}{\lambda}) C(=λ1)
Large C:Lower bias, high variance
Small C:Higher bias, low variance

σ 2 \sigma^2 σ2
Large σ 2 \sigma^2 σ2:Features f i f_i fivary smoothly (Higher bias, lower variance)
Small σ 2 \sigma^2 σ2:Features f i f_i fivary less smoothly (Lower bias, higher variance)

Mercer’s Theorem(默塞尔定理)

Note: Not all similarity functions s i m i l a r i t y ( x , l ) similarity(x,l) similarity(x,l) make valid kernels.(Need to satisfy technical called “Mercer’s Theorem” to make sure SVM package’s optimizations run correctly, and do not diverge).

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值