9.25机械学习系统的设计、支持向量机

最新推荐文章于 2024-04-23 23:00:16 发布

花匠键盘

最新推荐文章于 2024-04-23 23:00:16 发布

阅读量198

点赞数

本文链接：https://blog.csdn.net/weixin_43344710/article/details/101316537

版权

Machine learning system design

Prioritizing what to work on: Spam classification example（确认执行的有先级）

How to spend your time to make it have low error?

Collect lots of data
Develop sophisticated features based on email routing information(from email header).
Develop sophisticated features for message body, e.g. should “discount” and “discounts”
Develop sophisticated algorithm to detect misspellings(e.g.m0rtage)

Error analysis（误差分析）

**Recommended approach **

Start with a simple algorithm that you can implement quickly .Implement it and test it on your cross-validation data.
Plot learning curves to decide if more data, more features, etc .are likely to help.
Error analysis: Manually examine the examples(in corss validation set)that your algorithm made errors on.See if you spot any systematic trend in what type of examples it is making errors on.
Test way:Manually examine the 100 errors, and categorize them based on the type and what features you think would have helped the algorihm classify them correctly

The importance of numerical evaluation

should discount/discounts/discounted/discounting be treated as the same word?
Error analysis may not be helpful for deciding if this is likely to improve performance. Only solution is to try it and see if it works

Error metrics for skewed classes（对于偏斜类误差度量）

skewed classes(偏斜类)

If you have very skewed classes it becomes mush harder to use just classification accuracy, because you can get very high classification accuracies or very low errors, and it’s not always clear if doing so is really improving the quality of your classifier because prediciting y equals 0 all the time doesn,t seem like a particularly good classifier.

Precision/Recall(查准率和召回率)

$y = 1$ in presence of rare class that we want to detect

Actual 1	class 0
1True positive	False positive
0False negative	True negative

Precision(of all patients where we predicted $y = 1$ ,what fraction actually has cancer?)
True positives/predicted positives = True positive/(True positive +False positive

Recall(of all patients that actually have cancer, what fraction did we correctly detect as having cancer? ) a high recall would be a good thing
True positives/actual positive = True positives /True pos +False neg

Trading off precision and recall(折衷精度和召回率)

How to compare precision/recall numbers?
to calculate $F_1$ Score:=2(PR)/P+R
P=1 or R=1 FScore = 1 is perfect

Data for machine learning

Large data rationale(大数据基本原理)

Use a learning algorithm with many parameters(e.g. logistic regression/linear regression with many features; neural network with many hidden units).
use a very large training set (Unlikely to overfit)

Support Vector Machines（SVM）

Optimization objective

sometime gives a cleaner and sometime gives a powerful way of learning complex nonlinear functions
$h_{\Theta} = 1/(1+e^{-{\theta^T}x})$

在这里插入图片描述
If $y = 1$ ,we want $h_\theta(x)\approx1,\theta^T>>0$
If $y = 0$ ,we want $h_\theta(x)\approx0,\theta^T<<0$

SVM hypothesis
$minC\displaystyle \sum^{m}_{i=1}{[y^{(i)}costt-1(\theta^Tx^{(i)})+(1-y^{(i)}cost_0(\theta^Tx^{(i)}))]}+\frac{1}{2}\displaystyle \sum^{m}_{i=1}\theta^2_j$

Hypothesis:
$h_\theta(x)=1$ if $\theta^T\geq0$
$h_\theta(x)=0$ if otherwise

Large Margin Intuition(直观上大间隔的理解)

margin间距

The Mathematics behind large margin classification(optional)大间隔分类器的数学原理

vector inner products向量内积
$∣ ∣ u ∣ ∣ = l e n g t h o f v e c t o r u$
p = Length of projection of v vector u (signed)
$u^Tv=p*||u||=u_1v_1+u_2v_2$
$min\frac{1}{2}\displaystyle \sum^{m}_{j=1}\theta^2_j= \frac{1}{2}(\sqrt{\theta_1^2+\theta_2^2})^2$ = $\frac{1}{2}||\theta||^2$
s.t…
$\theta^Tx^(i)\geq1$ if $y^{(i)}=1$
$\theta^Tx^(i)\leq1$ if $y^{(i)}=0$

simplification: $\Theta_0=0, n=2$

If you look at this hypothesis , we want the projections of my positive and negative examples onto $\theta$ to be large and the only way for that to hold true is if surrounding the line, to keep a large margin

this is why this support vector machine ends up with enlarge margin classifiers because it is trying to maximize the norm of these P1 which is the distance from the training example to the decision boundary

if $\theta = 0$ ,we are entertaining only decision boundaries pass through the origin .

Kernels I

this is actually called a Gaussian kernel(particular choice of similarity function )similarity function

Kernels II

$C(=\frac{1}{\lambda})$
Large C:Lower bias, high variance
Small C:Higher bias, low variance

$\sigma^2$
Large $\sigma^2$ :Features $f_i$ vary smoothly (Higher bias, lower variance)
Small $\sigma^2$ :Features $f_i$ vary less smoothly (Lower bias, higher variance)

Mercer’s Theorem(默塞尔定理)

Note: Not all similarity functions $s i m i l a r i t y (x, l)$ make valid kernels.(Need to satisfy technical called “Mercer’s Theorem” to make sure SVM package’s optimizations run correctly, and do not diverge).

花匠键盘

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
9.25机械学习系统的设计、支持向量机

Machine learning system designPrioritizing what to work on: Spam classification exampleHow to spend your time to make it have low error?Collect lots of dataDevelop sophisticated features based on...
复制链接

扫一扫