本文是学习Andrew Ng的机器学习系列教程的学习笔记。教学视频地址:
https://study.163.com/course/introduction.htm?courseId=1004570029#/courseDetail?tab=1
49. Machine learning system design: prioritizing what to work on: spam classification example
以建立垃圾邮件过滤系统为例,首先建立分类器:
选择高频词汇作为特征。
如何降低分类器的错误率,举例:
- 收集大量数据
- 使用从邮件路由信息(比如发件人、标题)中提取的复杂特征,比如空标题、@saler.com等
- 使用从邮件内容中提取的复杂特征,比如由降价、促销等词汇
- 识别错误拼写
50. Machine Learning system design: Error analysis
方法论:
错误分析:
看看各种情况的分布,占比大的情况可以改进算法进行识别,尝试各种新的方法(更多数据、更多特征...),然后看看引起误差的主要原因;
算法最好能够返回量化的检验结果,比如返回错误率,这样根据引入不同的特征或方法(比如是否使用提取词干)获得的错误率来决定如何做更好:
如果引入词干提取的错误率更小,就采用引入词干分析的算法;
51. Machine learning system design: Error metric for skewed classes
skewed classes 偏斜类
accuracy 精确度
Precision 查准率
Recall 召回率
查准率和召回率越高越好;
if a classify is getting high precision and high recall then we are actually confident that the algorithm has to be doing well, even if we have very skewed classes.
So for the problem of skewed classes, precision and recall gives us more direct insight into how the learning algorithm is doing, and this is often a much better way to evaluate our learning algorithms than looking at classification error(分类误差) or classification accuracy(分类准确率) when the classes are very skewed.
51. Machine learing system design: Trading off precision and recall
threshold 临界值
被查出来的很少,但是一旦查出来,就可以确定->高查准率,低召回率。比如垃圾邮件,你可不希望错过正常邮件;
被查出来的很多,但是查出来的有很多是误判->低查准率,高召回率。比如预测癌症,保持怀疑态度:)
use F function to compute if the precision and recall is ok.
52. Machine learning system design: data for machine learning
In such condition, the size of training set will advance the algorithm.
in this case, large training set can get good result and no need to discuss using which algorithms.
key test:
first, can a human experts look at the features x and confidently predict the value of y.
second, can we actually get a large training set and training the learning algorithm with a lot of parameters in the training set.
If you can do the both, you often can get a very good algorithm.