【学习笔记】吴恩达机器学习 | 第九章 | 机器学习系统设计

最新推荐文章于 2024-10-02 21:20:07 发布

Chency.

最新推荐文章于 2024-10-02 21:20:07 发布

阅读量299

点赞数 3

分类专栏：【学习笔记】吴恩达机器学习学习笔记文章标签：学习机器学习人工智能线性回归逻辑回归

本文链接：https://blog.csdn.net/jermy00/article/details/131745223

版权

学习笔记同时被 2 个专栏收录

27 篇文章 33 订阅

订阅专栏

【学习笔记】吴恩达机器学习

17 篇文章 29 订阅

订阅专栏

在这里插入图片描述

Created: July 15, 2023 4:01 PM

简要声明

课程学习相关网址
由于课程学习内容为英文，文本会采用英文进行内容记录，采用中文进行简要解释。
本学习笔记单纯是为了能对学到的内容有更深入的理解，如果有错误的地方，恳请包容和指正。
非常感谢Andrew Ng吴恩达教授的无私奉献！！！

专有名词

Error analysis	错误分析	Numerical evaluation	数值评估
skewed classes	不对称性分类	Precision	查准率
Recall	召回率	F_1 Score	F值

Prioritizing what to work on

Building a spam classifier

在这里插入图片描述

Span 垃圾邮件（y=1）Non-span 非垃圾邮件（y=0）
Supervised learning 监督学习
1. x = features of email 邮件的特征向量x
2. y = spam (1) or not spam (0) 分类标签y
Features x: Choose 100 words indicative of spam/not spam →选择100个单词区分垃圾邮件
Note: In practice, take most frequently occurring words ( 10,000 to 50,000) in training set, rather than manually pick 100 words →在训练集中挑选出出现频率最多的n个单词（一般10,000到50,000个）作为特征向量
How to spend your time to make it have low error? →如果在有限的时间内让垃圾邮件分类器具有高精度和低误差
1. Collect lots of data →收集大量的数据 →数据收集得越多，算法越准确
2. Develop sophisticated features based on email routing information →用更复杂的特征变量来描述邮件包括发件人信息
3. Develop sophisticated features for message body →构建更复杂的特征描述邮件的主体部分/正文
4. Develop sophisticated algorithm to detect misspellings →设计更复杂的算法检测故意出现的拼写错误

Error analysis & Numerical evaluation

Recommended approach

Start with a simple algorithm that you can implement quickly, Implement it and test it on your cross-validation data →通过一个简单的算法来快速实现并通过交叉验证集来测试
Plot learning curves to decide if more data, more features are likely to help →画出相应的学习曲线以及检验误差来找出算法是否存在高偏差或者高方差问题或者其他问题，来决定是否使用更多的数据或者特征来改善算法
Error analysis: Manually examine the examples (in the cross-validation set) that your algorithm made errors on. See if you spot any systematic trend in what type of examples it is making errors on →错误分析：手动检查交叉验证集中被算法错误分类的样本，看看有什么共同的特征和规律从而改进算法
手动去检查算法所出现的失误 →引导走向最有成效的道路
保证对学习算法有一种数值估计的方法 →当改进学习算法的时候，返回一个数值评价指标来估计算法执行的效果
Error analysis may not be helpful for deciding if this is likely to improve performance. Only solution is to try it and see if it works →即使进行错误分析可能也无法帮助做决定的话，最好的解决方案是尝试使用一下它，看它是否能起到效果
在交叉验证集上做错误分析而不是在训练集上做

Error metrics for skewed classes

skewed classes

skewed classes 不对称性分类：一个类中的样本数与另一个类的数据相比多很多
通过总是预测y=0或者总是预测y=1算法可能表现的非常好
使用分类误差或者分类精确度来作为评估度量可能会产生这样的问题 →分类误差或者分类精确度提高可能是因为将代码替换成总是预测y=0或者y=1的算法
如果是一个不对称性分类，用分类精确度并不能很好地衡量算法

Precision/Recall

在这里插入图片描述

Predicted class 预测类 \ Actual class 实际类	1	0
1	True positive 真阳性	False positive 假阳性
0	False positive 假阴性	True negative 真阴性

Precision 查准率 →对于所有的预测，预测阳性中有真阳性的比率 →查准率越高越好
阳性简写成pos 阴性简写成neg
Recall 召回率 →如果所有都是阳性，预测出真阳性的比率 →召回率越高越好
习惯性地用y=1作为比较少的类来进行检测

$\frac{True\ positive}{Predicted\ positive}=\frac{True\ positive}{True\ positive + False\ positive}$

$Recall=\frac{True\ positive}{Actual\ positive}=\frac{True\ positive}{True\ positive + False\ negative}$

Trading off precision and recall

在这里插入图片描述

Suppose we want to predict y=1 (cancer) only if very confident →假如希望在非常确信的情况下才预测一个病人得了癌症 →Predict 1 if h_θ(x) ≥ 0.7 Predict 0 if h_θ(x) < 0.7 →Higher precision 高查准率, lower recall 低召回率
Suppose we want to avoid missing too many cases of cancer (avoid false negatives) →假设希望避免遗漏掉患有癌症的人（避免假阴性） →Predict 1 if h_θ(x) ≥ 0.3 Predict 0 if h_θ(x) < 0.3 →Higher recall 高召回率, lower precision 低查准率
More generally: Predict 1 if h_θ(x) ≥ threshold →当h_θ(x)大于某个临界值

F_1 Score (F score)

$F_1\ Score = 2\times\frac{Precision\times Recall}{Precision + Recall}$

F值夜叫做F_1值，一般写作F_1值，但是一般只说F值
通过F值判断算法中的查准率和召回率，F值越高，查准率和召回率均衡越高

Data for machine learning

Large data rationale

Assume feature x has sufficient information to predict y accurately →假设特征x包含足够的信息来准确地预测y
特征数量不够依然无法准确预测y
Use a learning algorithm with many parameters →使用一个有很多参数的学习算法 →J_train(θ)会很小
Use a very large training set (unlikely to overfit) →使用一个非常大的训练集 →J_train(θ)≈J_test(θ)
a learning algorithm with many parameters + a very large training set →J_test(θ)会很小
一个有很多参数的学习算法确保低偏差，一个非常大的训练集确保低方差，最终的到低偏差和低方差的学习算法

吴恩达教授语录

“By the way, in fact, if you even get to the stage where you brainstorm a list of different options to try, you’re probably already ahead of the curve. Sadly, what most people do is instead of trying to list out the options of things you might try, what far too main people do is wake up one morning and for some reason just, have a weird gut feeling that, Oh let’s have a huge honeypot project to go and collect tons more data. And for whatever strange reason just wake up one morning and randomly fixate on one thing and just work on that for six months. But I think we can do better.”
“When starting on the new machine learning problem, what I almost always recommend is to implement a quick and dirty implementation of your learning algorithm.”
“It’s not who has the best algorithm that wins. It’s who has the most data.” —Banko and Brill