# 机器学习实践中的 7 种常见错误

http://ml.posthaven.com/machine-learning-done-wrong

http://blog.jobbole.com/70684/

Statistical modeling is a lot like engineering.

In engineering, there are various ways to build a key-value storage, and each design makes a different set of assumptions about the usage pattern. In statistical modeling, there are various algorithms to build a classifier, and each algorithm makes a different set of assumptions about the data.

When dealing with small amounts of data, it’s reasonable to try as many algorithms as possible and to pick the best one since the cost of experimentation is low. But as we hit “big data”, it pays off to analyze the data upfront and then design the modeling pipeline (pre-processing, modeling, optimization algorithm, evaluation, productionization) accordingly.

As pointed out in my previous post, there are dozens of ways to solve a given modeling problem. Each model assumes something different, and it’s not obvious how to navigate and identify which assumptions are reasonable. In industry, most practitioners pick the modeling algorithm they are most familiar with rather than pick the one which best suits the data. In this post, I would like to share some common mistakes (the don't-s). I’ll save some of the best practices (the do-s) in a future post.

# 1. Take default loss function for granted

Many practitioners train and pick the best model using the default loss function (e.g., squared error). In practice, off-the-shelf loss function rarely aligns with the business objective. Take fraud detection as an example. When trying to detect fraudulent transactions, the business objective is to minimize the fraud loss. The off-the-shelf loss function of binary classifiers weighs false positives and false negatives equally. To align with the business objective, the loss function should not only penalize false negatives more than false positives, but also penalize each false negative in proportion to the dollar amount. Also, data sets in fraud detection usually contain highly imbalanced labels. In these cases, bias the loss function in favor of the rare case (e.g., through up/down sampling).

# 2. Use plain linear models for non-linear interaction

## 4．样本数少于特征数（n<<p）时使用高方差模型

SVM是现有建模算法中最受欢迎算法之一，它最强大的特性之一是，用不同核函数去拟合模型的能力。SVM核函数可被看作是一种自动结合现有特征，从而形成一个高维特征空间的方式。由于获得这一强大特性不需任何代价，所以大多数实践者会在训练SVM模型时默认使用核函数。然而，当数据样本数远远少于特征数（n<<p）—业界常见情况如医学数据—时,高维特征空间意味着更高的数据过拟合风险。事实上，当样本数远小于特征数时，应该彻底避免使用高方差模型。

## 7. 将线性或逻辑回归模型的系数绝对值解释为特征重要性

• 本文已收录于以下专栏：

举报原因： 您举报文章：机器学习实践中的 7 种常见错误 色情 政治 抄袭 广告 招聘 骂人 其他 (最多只允许输入30个字)