在“我爱机器学习”微博上看到的文章,地址是http://www.52ml.net/15845.html
http://ml.posthaven.com/machine-learning-done-wrong
把原文和微博上的回答整理了一下:这7个错误自己都犯过,算是再次提醒了。
带引号部分为张栋老师在微博上的点评。
Preface
Statistical modeling is a lot like engineering.
In engineering, there are various ways to build a key-value storage, and each design makes a different set of assumptions about the usage pattern.
In statistical modeling, there are various algorithms to build a classifier, and each algorithm makes a different set of assumptions about the data.
When dealing with small amounts of data, it’s reasonable to try as many algorithms as possible and to pick the best one since the cost of experimentation is low.
But as we hit “big data”, it pays off to analyze the data upfront and then design the modeling pipeline(pre-processing, modeling, optimization algorithm, evaluation,product ionization) accordingly.
As pointed out in my previous post,there are dozens of ways to solve a given modeling problem.
Each model assumes something different, and it’s not obvious how to navigate and identify which assumptions are reasonable.
In industry, most practitioners pick the modeling algorithm they are most familiar with rather than pick the one which best suits the data.
In this post, I would like to share some common mistakes (the don't-s).
I’ll save some of the best practices (the do-s) in a future post.
前言部分,作者提出了两个现象:
1.现在数据运算的成本很低了,大家都是直接所有方法都试一遍。
2.人们更倾向于使用自己熟悉的方法而不是最适合需要处理的数据的方法。
写到这里,即使如我这种屌丝都看出作者是不赞成这样的。下文就开始分析7大常犯错误。
1. Take default loss function for granted 想当然用缺省Loss
Many practitioners train and pick the best model using the default loss function (e.g., squared error).
In practice, off-the-shelf loss function rarely aligns with the business objective.
Take fraud detection as an example. When trying to detect fraudulent transactions, the business objective is to minimize the fraud loss.
The off-the-shelf loss function of binary classifiers weighs false positives and false negatives equally.
To align with the business objective, the loss function should not only penalize false negatives more than false positives, but also penalize each false negative in proportion to the dollar amount.
Also, data sets in fraud detection usually contain highly imbalanced labels.
In these cases, bias the loss function in favor of the rare case (e.g., through up/down sampling).
“机器学习本质上是在解一个优化问题,优化目标定义错误(或者 loss function 定义错了),就全错了!”
作者在这里以欺诈交易识别为例,说明了常用的损失函数(例如误差的平方和)是不适用的。
2. Use plain linear models for non-linear interaction 非线性情况下用线性模型
3. Forget about outliers 忘记异常值
Outliers are interesting. Depending on the context, they either deserve special attention or should be completely ignored. Take the example of revenue forecasting. If unusual spikes of revenue are observed, it's probably a good idea to pay extra attention to them and figure out what caused the spike. But if the outliers are due to mechanical error, measurement error or anything else that’s not generalizable, it’s a good idea to filter out these outliers before feeding the data to the modeling algorithm.
Some models are more sensitive to outliers than others. For instance, AdaBoost might treat those outliers as "hard" cases and put tremendous weights on outliers while decision tree might simply count each outlier as one false classification. If the data set contains a fair amount of outliers, it's important to either use modeling algorithm robust against outliers or filter the outliers out.
“很多情况下,如果不把 Outlier 数据提前过滤,就要采用可处理 Outlier 的模型(或者在模型训练过程中加入处理 Outlier 数据的算法)”。
作者特别举了AdaBoost的例子,不知道类似的Boosting或随机森林是不是对异常值也很敏感。
另外作者提出异常值也是很有价值的,而不是单纯舍弃。
4. Use high variance model when n<<p 样本少时用High Viriance模型
5. L1/L2/... regularization without standardization 不做标准化就用L1/L2等正则
Applying L1 or L2 to penalize large coefficients is a common way to regularize linear or logistic regression. However, many practitioners are not aware of the importance of standardizing features before applying those regularization.
Returning to fraud detection, imagine a linear regression model with a transaction amount feature. Without regularization, if the unit of transaction amount is in dollars, the fitted coefficient is going to be around 100 times larger than the fitted coefficient if the unit were in cents. With regularization, as the L1 / L2 penalize larger coefficient more, the transaction amount will get penalized more if the unit is in dollars. Hence, the regularization is biased and tend to penalize features in smaller scales. To mitigate the problem, standardize all the features and put them on equal footing as a preprocessing step.
“特征标准化是很重要的预处理:多维度特征组合在一起时,特征具有同一尺度的可比性很重要 ”
好吧,第一次仔细思考正则化与标准化的问题,值得重视,
作者举了以美元为单位的数据和以美分为单位的数据在一起的例子,确实挺可怕的
http://blog.csdn.net/zouxy09/article/details/24971995
6. Use linear model without considering multi-collinear predictors 不考虑线性相关直接用线性模型
7. Interpreting absolute value of coefficients from linear or logistic regression as feature importance LR模型中用参数绝对值判断feature重要性。
Summary
So there you go: 7 common mistakes when doing ML in practice. This list is not meant to be exhaustive but merely to provoke the reader to consider modeling assumptions that may not be applicable to the data at hand. To achieve the best model performance, it is important to pick the modeling algorithm that makes the most fitting assumptions -- not just the one you’re most familiar with.
作者最后再一次提醒大家要根据数据选择合适的方法。
补充