《成为一名机器学习工程师》_成为机器学习工程师的真相

《成为一名机器学习工程师》

重点 (Top highlight)

I recently was a part of an interesting Reddit discussion and a few of my answers got highly upvoted. The main point of it was the untold truths of being a machine learning engineer. I am sharing the key takeaways in a curated manner as I was one of the more active participants.

最近, 我参加了一个有趣的Reddit讨论,我的一些回答得到了高度评​​价。 重点是成为一名机器学习工程师的真相。 由于我是较活跃的参与者之一,所以我正在以策划的方式分享关键要点。

Image for post
Reddit Reddit上

1.使用深度学习 (1. Using Deep Learning)

Many Machine Learning enthusiasts think that they will play with fancy Deep Learning models, tune Neural Network architectures and hyperparameters. Don’t get me wrong, some do, but not many.

许多机器学习爱好者认为他们将使用精美的深度学习模型,调整神经网络架构和超参数。 不要误会我的意思,有的是错的,但不是很多。

The truth is that ML engineers spend most of the time working on “how to properly extract the training set that will resemble real-world problem distribution”. Once you have that, you can in most cases train a classical Machine Learning model and it will work well enough.

事实是,机器学习工程师大部分时间都花在“ 如何正确提取类似于实际问题分布的训练集 ”上。 一旦有了这一点,在大多数情况下,您就可以训练出经典的机器学习模型,并且该模型将足够有效。

出于好奇,这些算法中最难解决的问题是什么? 哪一个被用来解决呢? (Just out of curiosity, which is the hardest problem being solved by any of these algorithms? And which one is being used to solve it?)

Image for post
Photo by Caleb George on Unsplash
Caleb GeorgeUnsplash拍摄的照片

Deep Learning has the most success in Computer Vision (eg., self-driving cars) and Natural Language Processing more recently (GPT-3, etc.). So researchers and practitioners who work in those areas most likely use Deep Learning.

深度学习在计算机视觉(例如,无人驾驶汽车)和自然语言处理(GPT-3等)中最成功。 因此,在这些领域工作的研究人员和实践者最有可能使用深度学习。

IMO All-time greatest achievement is DeepMind’s AlphaGo Zero. The self-driving car is probably the one that will have the most impact on society. The most recent achievement in Natual Language Processing is GPT-3.

IMO历史上最大的成就是DeepMind的AlphaGo Zero。 自动驾驶汽车可能是对社会影响最大的汽车。 母语语言处理领域的最新成就是GPT-3。

与经典ML模型相比,深度学习模型难于解释吗? (Are Deep Learning models difficult to explain in comparison to classic ML models?)

OP said it nicely:

OP很好​​地说:

Can’t see how explaining a Convolutional Neural Net would be any harder than explaining a whole classification framework based on SVMs, Random Forests or Gradient Boosting.

没有比解释基于支持向量机,随机森林或梯度提升的整个分类框架更难解释卷积神经网络了。

I feel like this statement has become less and less true over the years as NNs have seen more research into interpretability.

多年来,随着NN对可解释性的研究越来越多,我觉得这种说法变得越来越不正确。

It clearly still holds when comparing NNs to good old traditional statistics like GLMs or Naive Bayes. But as soon as you move to CART based methods or anything using the kernel trick this fabled interpretability goes out the window.

将NN与GLM或朴素贝叶斯(Naive Bayes)等良好的传统传统统计数据进行比较时,显然仍然适用。 但是,一旦您转向基于CART的方法或任何使用内核技巧的方法,那么这种寓言般的可解释性便荡然无存。

2.学习机器学习 (2. Learning Machine Learning)

Image for post
Photo by NeONBRAND on Unsplash
NeONBRANDUnsplash拍摄的照片

When learning, you tend to go through a lot of papers on arxiv-sanity with some really cool algorithms. Then you enter the industry and all you see is relatively basic stuff like logistic regression, feedforward NNs, random forests (decision trees), bag-of-words instead of embeddings, and you feel like these models could be implemented by the average undergrad or even a smart high schooler. Maybe if you’re lucky you’ll see an SVM.

在学习时,您倾向于使用一些非常酷的算法来浏览大量关于arxiv-sanity的论文。 然后进入该行业,您所看到的只是相对基本的东西,例如逻辑回归,前馈神经网络,随机森林(决策树),单词袋而不是嵌入词,您会觉得这些模型可以由平均本科生或本科生实现。甚至是聪明的高中生。 也许幸运的话,您会看到一个SVM。

Infrastructure and data pipelines are where all the real engineering work happens.

基础架构和数据管道是所有实际工程工作发生的地方

I felt similar to the OP above at the beginning of my career. But why would you use a more complicated tool to solve the task when there’s no need for it. Many real-world problems don’t require state-of-the-art NN architecture to be solved. Sometimes a simple logistic regression gets the job done.

在我的职业生涯开始时,我感觉与上述OP类似。 但是,为什么在不需要时使用更复杂的工具来解决任务。 许多现实世界中的问题不需要解决最新的NN体系结构。 有时,简单的逻辑回归就可以完成工作。

The second part of the comment is true for smaller startups in which you usually have to take care of data pipelines by yourself. In bigger companies, there are designated departments that deal with infrastructure. But there are no shortcuts — Data Scientists still need to be well informed about how data infrastructure works.

注释的第二部分对于较小的初创公司是正确的,在这种情况下,您通常必须自己照顾数据管道。 在较大的公司中,有专门负责基础架构的部门。 但是没有捷径可走-数据科学家仍然需要充分了解数据基础架构的工作原理。

3.学习理论 (3. Learning Theory)

Image for post
🇸🇮 Janko Ferlič on 🇸🇮JankoFerlič摄Unsplash Unsplash

Learn as much fancy theory as you want, but at the end of the day, your job is going to be 99% data cleaning and infrastructure work.

学习所需的理论知识,但是最终,您的工作将是99%的数据清理和基础架构工作。

99% is a bit overexaggerated. To rephrase the OP: Machine Learning Engineers don’t just play with fancy models. Sometimes they need to get their hands dirty by cleaning and labeling the data.

99%有点夸张。 重新描述OP:机器学习工程师不只是玩花哨的模型。 有时他们需要清理和标记数据来弄脏自己的手。

您为什么不使用软件和服务来标记数据? (Why don’t you use software and services to label data?)

This is very true. So much so that I thought I was alone. I work mostly in NLP and 99% of my job is labelling data and making some infrastructure in Java.

这是真的。 以至于我以为我一个人。 我主要在NLP工作,而我99%的工作是在Java中标记数据并建立一些基础结构。

Data labeling services are usually too expensive for the big datasets that are used in practice. Some datasets are not trivial to label. I had an experience where I was working on invoice classification and you would need professional accountants to label that data.

对于实际使用的大型数据集而言,数据标记服务通常过于昂贵。 某些数据集并非易事。 我有从事发票分类的经验,您将需要专业会计师来标记该数据。

机器学习在现实世界中如何看待? (How does Machine Learning look in the real-world?)

Image for post
imgflip imgflip创建的模因

I increasingly notice that there is a gap in understanding what do Data Scientists do. Many aspiring Data Scientists are then disappointed when expectations don’t meet reality. Data Science is not just about tweaking parameters of your favorite model and getting higher on the Kaggle leaderboard- what if I told you there is no leaderboard in the real world?!?

我越来越注意到,在理解数据科学家的工作方面存在差距。 当期望不能满足现实时,许多有抱负的数据科学家就会感到失望。 数据科学不仅仅是调整您喜欢的模型的参数并在Kaggle排行榜上变得更高- 如果我告诉您现实世界中没有排行榜怎么办?!?

That is the reason I wrote Your First Machine Learning Model in the Cloud Ebook to show how does working on an actual Data Science projects looks from start to finish. This Ebook is aimed at Data Science enthusiasts and Software Engineers who are thinking to pursue a career in Data Science.

这就是我在Cloud Ebook中编写“ 您的第一个机器学习模型”的原因,以显示从头到尾处理实际数据科学项目的工作情况。 这本电子书适用于正在追求数据科学职业的数据科学爱好者和软件工程师。

你走之前 (Before you go)

I am building an online business focused on Data Science. I tweet about how I’m doing it. Follow me there to join me on my journey.

我正在建立一个专注于数据科学的在线业务。 我在推特上介绍了我的做法。 跟着我到我的旅程中。

These are a few links that might interest you:

这些链接可能会让您感兴趣:

- Data Science Nanodegree Program- AI for Healthcare- Autonomous Systems- Your First Machine Learning Model in the Cloud- 5 lesser-known pandas tricks- How NOT to write pandas code- Parallels Desktop 50% off
Image for post
Photo by Courtney Hedger on Unsplash
照片由 Courtney HedgerUnsplash拍摄

翻译自: https://towardsdatascience.com/untold-truths-of-being-a-machine-learning-engineer-364218db2317

《成为一名机器学习工程师》

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值