周末抽时间拜读了下google的机器学习43条基本准则,收获颇丰。按照自己的理解,添加了一些备注,希望能够帮到大家。
Before Machine Learning
-
Don’t be afraid to launch a product without machine learning.
-
First, design and implement metrics.
-
Choose machine learning over a complex heuristic.
如果一个系统规则太过复杂,则直接用模型,因为迭代和管理比较方便
ML Phase I: Your First Pipeline
-
Keep the first model simple and get the infrastructure right.
Baseline model一定要简单可分析,这样才有利于迭代 -
Test the infrastructure independently from the machine learning.
一切必须是可测试的,包括样本的处理、模型的验证以及模型吐出的结果的正确性 -
Turn heuristics into features, or handle them externally.
从规则到模型,如果觉得有些规则特别好,则可以直接把规则当成模型的特征 -
Watch for silent failures.
一定注重数据的更新,每一维特征的覆盖率都必须做到有指标可以统计 -
Give feature columns owners and documentation.
每一维特征必须有迹可循,必要的时候落实到文档 -
Choose a simple, observable and attributable metric for your first objective.
训练目标必须简单可迭代,最好是二分类指标,而不是回归指标 -
Starting with an interpretable model makes debugging easier.
可解释性较强的model更方便debug
ML Phase II: Feature Engineering
-
Plan to launch and iterate.
模型和pipeline必须可迭代,方便添加和删除特征 -
Start with directly observed and reported features as opposed to learned features.
一开始用到的特征不能太复杂,不能使用别的模型产出的特征,比如DNN产出的embedding,这样维护成本会很高,而且其他模型的目标和自己模型的目标有可能不同,因而它学出来的东西不一定适用于你的模型 -
Explore with features of content that generalize across contexts.
上下文特征很重要 -
Use very specific features when you can.
特征约详细,越简单越好,越抽象也不好 -
Combine and modify existing features to create new features in human-understandable ways.
创造新特征的方式必须是人类可理解的 -
The number of feature weights you can learn in a linear model is roughly proportional to the amount of data you have.
用的数据集越大,特征就可以越多 -
Clean up features you are no longer using.
直接删除无用特征以及和这些特征交叉之后的特征。
如果一个特征的覆盖面很小而且分布很均匀,则直接删除;如果一个特征虽然覆盖面很小,但所有有特征的样本分布很不均匀,则这个特征就是一个好特征 -
You are not a typical end user.
一定要找一个典型用户来作为系统的体验者 -
Measure the delta between models.
区分新模型和旧模型的手段在于看diff(delta) -
Look for patterns in the measured errors, and create new features.
看下模型预测失败的样本的共性,并且添加新的特征 -
Try to quantify observed undesirable behavior.
观察不符合预期的现象,一定要先查 -
Re-use code between your training pipeline and your serving pipeline whenever possible.
train和serve的代码尽量保持一致,这样debug比较方便 -
If you produce a model based on the data until January 5th, test the model on the data from January 6th and after.
训练和验证一定要杜绝特征的穿越 -
Measure Training/Serving Skew.
量化线上和线下的差异:1.训练数据和验证数据(hold out data)又可能存在diff,2. 验证数据和next-day数据可能存在diff,因为有的特征是time-sensitive的,3. 如果next-day数据和线上数据依然有差异,则应该是程序有bug
ML Phase III: Slowed Growth, Optimization Refinement, and Complex Models
-
Don’t waste time on new features if unaligned objectives have become the issue.
如果发现模型的目标和产品的目标不相吻合,则需要改变模型的目标或者产品的目标,而不应该把时间浪费在新特征上 -
Launch decisions are a proxy for long-term product goals.
模型上线需要同时考虑很多指标,一个不是特别完美的模型总是会使某些指标上升(ctr),某些指标下降(dau),这时就需要组内老大明确下这次上线预期收益到底是什么,是提高pv还是提高点击率 -
Keep ensembles simple.
集成的每个模型必须要简答,base model指标的涨跌必须和ensemble model的涨跌是正相关的,只有这样ensemble model的效果才会更好 -
When performance plateaus, look for qualitatively new sources of information to add rather than refining existing signals.
当达到瓶颈时,不要浪费太多时间在已有特征上,必须添加新特征 -
Don’t expect diversity, personalization, or relevance to be as correlated with popularity as you think they are.
模型的目标必须尽量简单,一个模型无法同时满足多样性、个性化、相关性的需求,有的需求可以通过添加后置规则来完成