读《Machine Learning Done Wrong》（机器学习易犯错误）有感

最新推荐文章于 2024-07-18 12:57:54 发布

wait1112

最新推荐文章于 2024-07-18 12:57:54 发布

阅读量680

点赞数

分类专栏：数据挖掘机器学习文章标签：数据挖掘

数据挖掘同时被 2 个专栏收录

1 篇文章 0 订阅

订阅专栏

机器学习

1 篇文章 0 订阅

订阅专栏

在“我爱机器学习”微博上看到的文章，地址是http://www.52ml.net/15845.html

http://ml.posthaven.com/machine-learning-done-wrong

把原文和微博上的回答整理了一下：这7个错误自己都犯过，算是再次提醒了。

带引号部分为张栋老师在微博上的点评。

Preface

Statistical modeling is a lot like engineering.

In engineering, there are various ways to build a key-value storage, and each design makes a different set of assumptions about the usage pattern.

In statistical modeling, there are various algorithms to build a classifier, and each algorithm makes a different set of assumptions about the data.

When dealing with small amounts of data, it’s reasonable to try as many algorithms as possible and to pick the best one since the cost of experimentation is low.

But as we hit “big data”, it pays off to analyze the data upfront and then design the modeling pipeline(pre-processing, modeling, optimization algorithm, evaluation,product ionization) accordingly.

As pointed out in my previous post,there are dozens of ways to solve a given modeling problem.

Each model assumes something different, and it’s not obvious how to navigate and identify which assumptions are reasonable.

In industry, most practitioners pick the modeling algorithm they are most familiar with rather than pick the one which best suits the data.

In this post, I would like to share some common mistakes (the don't-s).

I’ll save some of the best practices (the do-s) in a future post.

前言部分，作者提出了两个现象：

1.现在数据运算的成本很低了，大家都是直接所有方法都试一遍。

2.人们更倾向于使用自己熟悉的方法而不是最适合需要处理的数据的方法。

写到这里，即使如我这种屌丝都看出作者是不赞成这样的。下文就开始分析7大常犯错误。

1. Take default loss function for granted 想当然用缺省Loss

Many practitioners train and pick the best model using the default loss function (e.g., squared error).

In practice, off-the-shelf loss function rarely aligns with the business objective.

Take fraud detection as an example. When trying to detect fraudulent transactions, the business objective is to minimize the fraud loss.

The off-the-shelf loss function of binary classifiers weighs false positives and false negatives equally.

To align with the business objective, the loss function should not only penalize false negatives more than false positives, but also penalize each false negative in proportion to the dollar amount.

Also, data sets in fraud detection usually contain highly imbalanced labels.

In these cases, bias the loss function in favor of the rare case (e.g., through up/down sampling).

“机器学习本质上是在解一个优化问题，优化目标定义错误（或者 loss function 定义错了），就全错了！”

作者在这里以欺诈交易识别为例，说明了常用的损失函数(例如误差的平方和)是不适用的。

2. Use plain linear models for non-linear interaction 非线性情况下用线性模型

When building a binary classifier, many practitioners immediately jump to logistic regression because it’s simple. But, many also forget that logistic regression is a linear model and the non-linear interaction among predictors need to be encoded manually. Returning to fraud detection, high order interaction features like "billing address = shipping address and transaction amount < $50" are required for good model performance. So one should prefer non-linear models like SVM with kernel or tree based classifiers that bake in higher-order interaction features.

”尽量通过特征处理和变换，把非线性情况用线性模型求解：因为线性模型具有训练算法简单可处理海量数据等特性“

作者提醒我们要注意高阶特征变量（例如分类变量等非数值化的离散变量），倾向于用SVM或决策树。

3. Forget about outliers 忘记异常值

Outliers are interesting. Depending on the context, they either deserve special attention or should be completely ignored. Take the example of revenue forecasting. If unusual spikes of revenue are observed, it's probably a good idea to pay extra attention to them and figure out what caused the spike. But if the outliers are due to mechanical error, measurement error or anything else that’s not generalizable, it’s a good idea to filter out these outliers before feeding the data to the modeling algorithm.

Some models are more sensitive to outliers than others. For instance, AdaBoost might treat those outliers as "hard" cases and put tremendous weights on outliers while decision tree might simply count each outlier as one false classification. If the data set contains a fair amount of outliers, it's important to either use modeling algorithm robust against outliers or filter the outliers out.

“很多情况下，如果不把 Outlier 数据提前过滤，就要采用可处理 Outlier 的模型（或者在模型训练过程中加入处理 Outlier 数据的算法）”。

作者特别举了AdaBoost的例子，不知道类似的Boosting或随机森林是不是对异常值也很敏感。

另外作者提出异常值也是很有价值的，而不是单纯舍弃。

4. Use high variance model when n<<p 样本少时用High Viriance模型

SVM is one of the most popular off-the-shelf modeling algorithms and one of its most powerful features is the ability to fit the model with different kernels. SVM kernels can be thought of as a way to automatically combine existing features to form a richer feature space. Since this power feature comes almost for free, most practitioners by default use kernel when training a SVM model. However, when the data has n<<p (number of samples << number of features) -- common in industries like medical data -- the richer feature space implies a much higher risk to overfit the data. In fact, high variance models should be avoided entirely when n<<p.

维数问题是老生长谈了，不过我看liblinear就是解决n<<p的？

5. L1/L2/... regularization without standardization 不做标准化就用L1/L2等正则

Applying L1 or L2 to penalize large coefficients is a common way to regularize linear or logistic regression. However, many practitioners are not aware of the importance of standardizing features before applying those regularization.

Returning to fraud detection, imagine a linear regression model with a transaction amount feature. Without regularization, if the unit of transaction amount is in dollars, the fitted coefficient is going to be around 100 times larger than the fitted coefficient if the unit were in cents. With regularization, as the L1 / L2 penalize larger coefficient more, the transaction amount will get penalized more if the unit is in dollars. Hence, the regularization is biased and tend to penalize features in smaller scales. To mitigate the problem, standardize all the features and put them on equal footing as a preprocessing step.

“特征标准化是很重要的预处理：多维度特征组合在一起时，特征具有同一尺度的可比性很重要 ”

好吧，第一次仔细思考正则化与标准化的问题，值得重视，

作者举了以美元为单位的数据和以美分为单位的数据在一起的例子，确实挺可怕的

http://blog.csdn.net/zouxy09/article/details/24971995

6. Use linear model without considering multi-collinear predictors 不考虑线性相关直接用线性模型

Imagine building a linear model with two variables X1 and X2 and suppose the ground truth model is Y=X1+X2. Ideally, if the data is observed with small amount of noise, the linear regression solution would recover the ground truth. However, if X1 and X2 are collinear, to most of the optimization algorithms' concerns, Y=2*X1, Y=3*X1-X2 or Y=100*X1-99*X2 are all as good. The problem might not be detrimental as it doesn't bias the estimation. However, it does make the problem ill-conditioned and make the coefficient weight uninterpretable.

“绝大多数情况下，“线性相关” 很少存在（比如广告点击率和飘红长度）但是：可以一个大的 "非线性相关问题" 转化成 N 个小的 "线性相关问题" ”

作者在这里举了自变量线性相关对建模的不良影响。

7. Interpreting absolute value of coefficients from linear or logistic regression as feature importance LR模型中用参数绝对值判断feature重要性。

Because many off-the-shelf linear regressor returns p-value for each coefficient, many practitioners believe that for linear models, the bigger the absolute value of the coefficient, the more important the corresponding feature is. This is rarely true as (a) changing the scale of the variable changes the absolute value of the coefficient (b) if features are multi-collinear, coefficients can shift from one feature to others. Also, the more features the data set has, the more likely the features are multi-collinear and the less reliable to interpret the feature importance by coefficients.

“LR 训练出来的特征权重和特征的重要性很相关，但并非完全代表特征的重要性（有很多情况需要特定考虑）”

其实我做回归的时候很少关注每个参数（或自变量）的显著性，看来也是有好处的。

Summary

So there you go: 7 common mistakes when doing ML in practice. This list is not meant to be exhaustive but merely to provoke the reader to consider modeling assumptions that may not be applicable to the data at hand. To achieve the best model performance, it is important to pick the modeling algorithm that makes the most fitting assumptions -- not just the one you’re most familiar with.

作者最后再一次提醒大家要根据数据选择合适的方法。