1. Prioritizing what to work on: Spam classification example
- collect lots of data
- developed sophisticated features based on email routing information
- developed sophisticated features for message body
- developed sophisticated algorithm to detect misspellings
2. Error analysis
- recommended approach
- Start with a simple algorithm that you can implement quickly. Implement it and test it on your cross-validation data.
- Plot learning curves to decide if more data, more features, etc. are likely to help.
- Error analysis: Manually examine the examples (in cross validation set) that your algorithm made errors on. See if you spot any systematic trend in what type of examples it is making errors on.
3. Error metrics for skewed classes
- Precision/Recall
4. Trading off precision and recall
- how to compare precision/recall numbers
- F_1 score (F score)
5. Data for machine learning
- It's not who has the best algorithm that wins. It's who has the most data.
什么是Skewed Classes呢?一个分类问题,如果结果仅有两类y=0和y=1,而且其中一类样本非常多,另一类非常少,我们称这种分类问题中的类为Skewed Classes.
考虑一个二分问题,即将实例分成正类(positive)或负类(negative)。对一个二分问题来说,会出现四种情况。如果一个实例是正类并且也被预测成正类,即为真正类(True positive),如果实例是负类被预测成正类,称之为假正类(False positive)。相应地,如果实例是负类被预测成负类,称之为真负类(True negative),正类被预测成负类则为假负类(false negative)。
- TP:正确肯定的数目;
- FN:漏报,没有正确找到的匹配的数目;
- FP:误报,给出的匹配是不正确的;
- TN:正确拒绝的非匹配对数;
From: http://blog.csdn.net/abcjennifer/article/details/7834256