【学习笔记】吴恩达机器学习 | 第九章 | 机器学习系统设计

在这里插入图片描述

Created: July 15, 2023 4:01 PM

简要声明


  1. 课程学习相关网址
    1. Bilibili
    2. 网易云课堂
    3. 学习讲义
  2. 由于课程学习内容为英文,文本会采用英文进行内容记录,采用中文进行简要解释。
  3. 本学习笔记单纯是为了能对学到的内容有更深入的理解,如果有错误的地方,恳请包容和指正。
  4. 非常感谢Andrew Ng吴恩达教授的无私奉献!!!

专有名词


Error analysis错误分析Numerical evaluation数值评估
skewed classes不对称性分类Precision查准率
Recall召回率F_1 ScoreF值

Prioritizing what to work on


Building a spam classifier

在这里插入图片描述

  1. Span 垃圾邮件(y=1)Non-span 非垃圾邮件(y=0)
  2. Supervised learning 监督学习
    1. x = features of email 邮件的特征向量x
    2. y = spam (1) or not spam (0) 分类标签y
  3. Features x: Choose 100 words indicative of spam/not spam →选择100个单词区分垃圾邮件
  4. Note: In practice, take most frequently occurring words ( 10,000 to 50,000) in training set, rather than manually pick 100 words →在训练集中挑选出出现频率最多的n个单词(一般10,000到50,000个)作为特征向量
  5. How to spend your time to make it have low error? →如果在有限的时间内让垃圾邮件分类器具有高精度和低误差
    1. Collect lots of data →收集大量的数据 →数据收集得越多,算法越准确
    2. Develop sophisticated features based on email routing information →用更复杂的特征变量来描述邮件包括发件人信息
    3. Develop sophisticated features for message body →构建更复杂的特征描述邮件的主体部分/正文
    4. Develop sophisticated algorithm to detect misspellings →设计更复杂的算法检测故意出现的拼写错误

Error analysis & Numerical evaluation


Recommended approach

  1. Start with a simple algorithm that you can implement quickly, Implement it and test it on your cross-validation data →通过一个简单的算法来快速实现并通过交叉验证集来测试
  2. Plot learning curves to decide if more data, more features are likely to help →画出相应的学习曲线以及检验误差来找出算法是否存在高偏差或者高方差问题或者其他问题,来决定是否使用更多的数据或者特征来改善算法
  3. Error analysis: Manually examine the examples (in the cross-validation set) that your algorithm made errors on. See if you spot any systematic trend in what type of examples it is making errors on →错误分析:手动检查交叉验证集中被算法错误分类的样本,看看有什么共同的特征和规律从而改进算法
  4. 手动去检查算法所出现的失误 →引导走向最有成效的道路
  5. 保证对学习算法有一种数值估计的方法 →当改进学习算法的时候,返回一个数值评价指标来估计算法执行的效果
  6. Error analysis may not be helpful for deciding if this is likely to improve performance. Only solution is to try it and see if it works →即使进行错误分析可能也无法帮助做决定的话,最好的解决方案是尝试使用一下它,看它是否能起到效果
  7. 在交叉验证集上做错误分析而不是在训练集上做

Error metrics for skewed classes


skewed classes

  1. skewed classes 不对称性分类:一个类中的样本数与另一个类的数据相比多很多
  2. 通过总是预测y=0或者总是预测y=1算法可能表现的非常好
  3. 使用分类误差或者分类精确度来作为评估度量可能会产生这样的问题 →分类误差或者分类精确度提高可能是因为将代码替换成总是预测y=0或者y=1的算法
  4. 如果是一个不对称性分类,用分类精确度并不能很好地衡量算法

Precision/Recall

在这里插入图片描述

Predicted class 预测类 \ Actual class 实际类10
1True positive 真阳性False positive 假阳性
0False positive 假阴性True negative 真阴性
  1. Precision 查准率 →对于所有的预测,预测阳性中有真阳性的比率 →查准率越高越好
  2. 阳性简写成pos 阴性简写成neg
  3. Recall 召回率 →如果所有都是阳性,预测出真阳性的比率 →召回率越高越好
  4. 习惯性地用y=1作为比较少的类来进行检测

P r e c i s i o n = T r u e   p o s i t i v e P r e d i c t e d   p o s i t i v e = T r u e   p o s i t i v e T r u e   p o s i t i v e + F a l s e   p o s i t i v e Precision = \frac{True\ positive}{Predicted\ positive}=\frac{True\ positive}{True\ positive + False\ positive} Precision=Predicted positiveTrue positive=True positive+False positiveTrue positive

R e c a l l = T r u e   p o s i t i v e A c t u a l   p o s i t i v e = T r u e   p o s i t i v e T r u e   p o s i t i v e + F a l s e   n e g a t i v e Recall=\frac{True\ positive}{Actual\ positive}=\frac{True\ positive}{True\ positive + False\ negative} Recall=Actual positiveTrue positive=True positive+False negativeTrue positive

Trading off precision and recall

在这里插入图片描述

  1. Suppose we want to predict y=1 (cancer) only if very confident →假如希望在非常确信的情况下才预测一个病人得了癌症 →Predict 1 if h_θ(x) ≥ 0.7 Predict 0 if h_θ(x) < 0.7 →Higher precision 高查准率, lower recall 低召回率
  2. Suppose we want to avoid missing too many cases of cancer (avoid false negatives) →假设希望避免遗漏掉患有癌症的人(避免假阴性) →Predict 1 if h_θ(x) ≥ 0.3 Predict 0 if h_θ(x) < 0.3 →Higher recall 高召回率, lower precision 低查准率
  3. More generally: Predict 1 if h_θ(x) ≥ threshold →当h_θ(x)大于某个临界值

F_1 Score (F score)

F 1   S c o r e = 2 × P r e c i s i o n × R e c a l l P r e c i s i o n + R e c a l l F_1\ Score = 2\times\frac{Precision\times Recall}{Precision + Recall} F1 Score=2×Precision+RecallPrecision×Recall

  1. F值夜叫做F_1值,一般写作F_1值,但是一般只说F值
  2. 通过F值判断算法中的查准率和召回率,F值越高,查准率和召回率均衡越高

Data for machine learning


Large data rationale

  1. Assume feature x has sufficient information to predict y accurately →假设特征x包含足够的信息来准确地预测y
  2. 特征数量不够依然无法准确预测y
  3. Use a learning algorithm with many parameters →使用一个有很多参数的学习算法 →J_train(θ)会很小
  4. Use a very large training set (unlikely to overfit) →使用一个非常大的训练集 →J_train(θ)≈J_test(θ)
  5. a learning algorithm with many parameters + a very large training set →J_test(θ)会很小
  6. 一个有很多参数的学习算法确保低偏差,一个非常大的训练集确保低方差,最终的到低偏差和低方差的学习算法

吴恩达教授语录


  • “By the way, in fact, if you even get to the stage where you brainstorm a list of different options to try, you’re probably already ahead of the curve. Sadly, what most people do is instead of trying to list out the options of things you might try, what far too main people do is wake up one morning and for some reason just, have a weird gut feeling that, Oh let’s have a huge honeypot project to go and collect tons more data. And for whatever strange reason just wake up one morning and randomly fixate on one thing and just work on that for six months. But I think we can do better.”
  • “When starting on the new machine learning problem, what I almost always recommend is to implement a quick and dirty implementation of your learning algorithm.”
  • “It’s not who has the best algorithm that wins. It’s who has the most data.” —Banko and Brill
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Chency.

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值