https://github.com/LongxingTan/Machine-learning-interview
- 算法工程师-机器学习-数据科学家面试准备1- 概述 外企和国外公司、春招、秋招
- 算法工程师-机器学习-数据科学家面试准备2- Leetcode 300
- 算法工程师-机器学习-数据科学家面试准备3-系统设计
- 算法工程师-机器学习-数据科学家面试准备4-ML系统设计
框架
- 理解和理清需求requirements和问题
- 系统的目标、主要功能
- 用户及场景use cases
- 清晰的边界
- 抽象问题转化为机器学习问题
- 定义评价指标(离线与在线)
- 系统架构
- 非ML部分
- ML部分
- 数据收集与准备
- 特征工程
- 模型开发与离线评测
- 推理服务
- 推理模式:batch, online, hybrid
- 模型压缩:量化、剪枝、蒸馏
- 在线测试与部署
- AB测试
- 扩展、监控、更新
- 扩展
- 软件系统:分布式,负载均衡
- 机器学习部分:分布式训练,分布式数据收集
- 扩展
- 监控
- 监控日志
- 监控指标
- 更新:持续训练
- 自动更新模型
- Human in the loop
场景
- Youtube recommendation/doordash search box/auto suggestion
- design youtube violent content detection system
可能聚焦的方向
- improve engagement on a feed
- improve customer churn
- return items from search engine query
follow-up问题
- solution
- how to scale
- Scaling general SW system (distributed servers, load balancer, sharding, replication, caching, etc)
- Train data / KB partitioning
- Distributed ML
- Data parallelism (for training)
- Model parallelism (for training, inference)
- Asynchronous SGD
- Synchronous SGD
- Distributed training
- Data parallel DT, RPC based DT
- Scaling data collection
- MT for 1000 languages
- NLLB
- Monitoring, failure tolerance, updating (below)
- Auto ML (soft: HP tuning, hard: arch search (NAS))
- 线上线下不一致
参考
-
https://github.com/khangich/machine-learning-interview
-
Machine Learning Engineering by Andriy Burkov
-
https://github.com/shibuiwilliam/ml-system-in-actions
-
https://github.com/mercari/ml-system-design-pattern
-
https://github.com/chiphuyen/machine-learning-systems-design
-
https://github.com/ibragim-bad/machine-learning-design-primer
-
https://www.1point3acres.com/bbs/thread-901192-1-1.html
-
Grokking the Machine Learning Interview
-
https://about.instagram.com/blog/engineering/designing-a-constrained-exploration-system
-
https://www.youtube.com/c/BitTiger