ML基本知识(十二)rule-of-ml(google)

周末抽时间拜读了下google的机器学习43条基本准则,收获颇丰。按照自己的理解,添加了一些备注,希望能够帮到大家。

Before Machine Learning

  1. Don’t be afraid to launch a product without machine learning.

  2. First, design and implement metrics.

  3. Choose machine learning over a complex heuristic.
    如果一个系统规则太过复杂,则直接用模型,因为迭代和管理比较方便

ML Phase I: Your First Pipeline

  1. Keep the first model simple and get the infrastructure right.
    Baseline model一定要简单可分析,这样才有利于迭代

  2. Test the infrastructure independently from the machine learning.
    一切必须是可测试的,包括样本的处理、模型的验证以及模型吐出的结果的正确性

  3. Turn heuristics into features, or handle them externally.
    从规则到模型,如果觉得有些规则特别好,则可以直接把规则当成模型的特征

  4. Watch for silent failures.
    一定注重数据的更新,每一维特征的覆盖率都必须做到有指标可以统计

  5. Give feature columns owners and documentation.
    每一维特征必须有迹可循,必要的时候落实到文档

  6. Choose a simple, observable and attributable metric for your first objective.
    训练目标必须简单可迭代,最好是二分类指标,而不是回归指标

  7. Starting with an interpretable model makes debugging easier.
    可解释性较强的model更方便debug

ML Phase II: Feature Engineering

  1. Plan to launch and iterate.
    模型和pipeline必须可迭代,方便添加和删除特征

  2. Start with directly observed and reported features as opposed to learned features.
    一开始用到的特征不能太复杂,不能使用别的模型产出的特征,比如DNN产出的embedding,这样维护成本会很高,而且其他模型的目标和自己模型的目标有可能不同,因而它学出来的东西不一定适用于你的模型

  3. Explore with features of content that generalize across contexts.
    上下文特征很重要

  4. Use very specific features when you can.
    特征约详细,越简单越好,越抽象也不好

  5. Combine and modify existing features to create new features in human­-understandable ways.
    创造新特征的方式必须是人类可理解的

  6. The number of feature weights you can learn in a linear model is roughly proportional to the amount of data you have.
    用的数据集越大,特征就可以越多

  7. Clean up features you are no longer using.
    直接删除无用特征以及和这些特征交叉之后的特征。
    如果一个特征的覆盖面很小而且分布很均匀,则直接删除;如果一个特征虽然覆盖面很小,但所有有特征的样本分布很不均匀,则这个特征就是一个好特征

  8. You are not a typical end user.
    一定要找一个典型用户来作为系统的体验者

  9. Measure the delta between models.
    区分新模型和旧模型的手段在于看diff(delta)

  10. Look for patterns in the measured errors, and create new features.
    看下模型预测失败的样本的共性,并且添加新的特征

  11. Try to quantify observed undesirable behavior.
    观察不符合预期的现象,一定要先查

  12. Re-use code between your training pipeline and your serving pipeline whenever possible.
    train和serve的代码尽量保持一致,这样debug比较方便

  13. If you produce a model based on the data until January 5th, test the model on the data from January 6th and after.
    训练和验证一定要杜绝特征的穿越

  14. Measure Training/Serving Skew.
    量化线上和线下的差异:1.训练数据和验证数据(hold out data)又可能存在diff,2. 验证数据和next-day数据可能存在diff,因为有的特征是time-sensitive的,3. 如果next-day数据和线上数据依然有差异,则应该是程序有bug

ML Phase III: Slowed Growth, Optimization Refinement, and Complex Models

  1. Don’t waste time on new features if unaligned objectives have become the issue.
    如果发现模型的目标和产品的目标不相吻合,则需要改变模型的目标或者产品的目标,而不应该把时间浪费在新特征上

  2. Launch decisions are a proxy for long-term product goals.
    模型上线需要同时考虑很多指标,一个不是特别完美的模型总是会使某些指标上升(ctr),某些指标下降(dau),这时就需要组内老大明确下这次上线预期收益到底是什么,是提高pv还是提高点击率

  3. Keep ensembles simple.
    集成的每个模型必须要简答,base model指标的涨跌必须和ensemble model的涨跌是正相关的,只有这样ensemble model的效果才会更好

  4. When performance plateaus, look for qualitatively new sources of information to add rather than refining existing signals.
    当达到瓶颈时,不要浪费太多时间在已有特征上,必须添加新特征

  5. Don’t expect diversity, personalization, or relevance to be as correlated with popularity as you think they are.
    模型的目标必须尽量简单,一个模型无法同时满足多样性、个性化、相关性的需求,有的需求可以通过添加后置规则来完成

参考链接

  1. Rules of Machine Learning
  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值