计算机机器学习 考研学校_机器学习的东西学校不教

计算机机器学习 考研学校

If you are a Machine Learning/Data Science enthusiast who desires to enter this field, chances are you must have taken Coursera or Fast.ai’s Deep Learning Specialization, or have come to Kaggle to practice and polish your skills. Those are great learning materials that will equip you with solid knowledge and nice training experiences.

如果您是希望学习该领域的机器学习/数据科学爱好者,那么您很可能已经参加了Coursera或Fast.ai的深度学习专业课程,或者来过Kaggle练习和提高您的技能。 这些都是很棒的学习材料,可以为您提供扎实的知识和良好的培训经验。

However, from training ground to the battlefield is still a great distance. Schools, courses, and competition only focus on machine learning algorithms, which only plays a small part in a real-life machine learning project. There are other things that courses and competitions will not help, and you can only learn once your foot is set in the real world.

但是,从训练场到战场距离仍然很远。 学校,课程和竞赛仅关注机器学习算法,这在现实生活中的机器学习项目中仅占很小的一部分。 还有其他事情,课程和比赛无济于事,只有踏上了现实世界,您才可以学习。

One year working as a Machine Learning Engineer has greatly impacted on my mindset and practices on how a machine learning project should be executed. In this post, I will share some of the lessons that I learned in this first year.

担任机器学习工程师的一年极大地影响了我关于如何执行机器学习项目的思想和实践。 在这篇文章中,我将分享我在第一年学到的一些经验教训。

问题陈述 (Problem statements)

培训:让我们解决问题。 现实生活:什么问题? (Training: let’s solve problems. Real-life: what problem?)

When taking courses or taking part in competitions, I was usually given machine learning problems, crafted by machine learning experts. Naturally, it comes with very clear instructions: the objectives, the dataset, context, and explanation, etc. My job was just to play with the data and produce the results, no question asked.

当疗程服用或参加比赛,我通常给机器学习 问题,通过机器学习专家的制作。 自然,它带有非常清晰的说明:目标,数据集,上下文和说明等。我的工作只是玩弄数据并产生结果,而没有提出任何问题

In real life, what comes to me are business problems, requested by the business team and/or product team. Thus, it should be expected that the problem statements can be confusing and ambiguous, with no instruction provided. Even when things seem clear, it can’t be sure that the way I interpret the problem is the same as the business team’s. So, my first task is not to solve the problem, but to ask questions.

在现实生活中,我想到的是业务团队和/或产品团队要求的业务问题 。 因此,应该预料到问题陈述可能会令人困惑和模棱两可,而没有提供任何说明。 即使事情看起来很明确,也无法确定我对问题的解释方式是否与业务团队的方式相同。 因此,我的首要任务不是解决问题,而是提出问题

Take an example: I work in an e-commerce firm. One day, a Product Manager gave me a shopping item and asked me to find the most similar items in the marketplace. Before thinking about collaborative filtering or other fancy algorithms, I would need to ask some very fundamental questions, like:

举个例子:我在一家电子商务公司工作。 有一天,产品经理给我买了一件购物商品,并要求我在市场上找到最相似的商品。 在考虑协作过滤或其他高级算法之前,我需要问一些非常基本的问题,例如:

  • Define “similarity” between item X and item Y. Same brand? Same product type? Or users who buy product X always buy product Y?

    在项目X和项目Y之间定义“相似性”。品牌是否相同? 相同的产品类型? 还是购买产品X的用户总是购买产品Y?
  • What is the business purpose? To recommend similar items? Build collections? Or detect duplicated items?

    经营目的是什么? 要推荐类似的物品吗? 建立馆藏? 还是检测重复的物品?
  • Am I building an online service or offline database?

    我是在构建在线服务还是离线数据库?
  • What is the timeline and roadmap for the task?

    该任务的时间表和路线图是什么?

It should be expected that as I’m gaining more experience, I will need to start taking the initiatives, i.e. identify business needs, come up with the problems, ask myself those questions and find my own answers.

可以预料,随着我获得更多的经验,我将需要开始采取主动行动,即确定业务需求,提出问题,问自己这些问题并找到自己的答案。

数据集 (Datasets)

培训:让我们分析数据。 现实生活:什么数据? (Training: let’s analyze the data. Real-life: what data?)

In training or competition, it’s a fair game where everyone has equal access to the same dataset. The result's quality is mostly decided by the algorithm.

在培训或比赛中,这是一个公平的游戏,每个人都可以平等地访问同一数据集。 结果的质量主要取决于算法。

Real-life is not a fair game. For many businesses, data is their greatest assets and it is the data, not the model, that decides the success of the project. The more I work, the more I find myself asking the like of questions that were never in concern while in school:

现实生活不是公平的游戏。 对于许多企业而言,数据是他们最大的资产,而决定项目成功的是数据而不是模型 。 我工作的时间越多,发现自己问的问题就越多,这些问题在上学时从未受到关注:

  • What are the available data sources? How to access them?

    有哪些可用的数据源? 如何访问它们?
  • Is the data labeled? If no, how to label them? If yes, is the label quality good?

    数据有标签吗? 如果没有,如何标记? 是,标签质量好吗?
  • Is there sufficient data for my algorithm? If no, how to get more data? If yes, how to process such large data?

    我的算法是否有足够的数据? 如果否,如何获取更多数据? 如果是,该如何处理这么大的数据?

The availability, quality, and quantity of the data have a decisive impact on each and every other step of the project. If the data is bad, then every machine learning model seems like bad choices. If the data is good, then even if-else rules could work. I often hear people say Data Scientists spend 80% of their time finding and processing data, now I know that they are telling the truth.

数据的可用性,质量和数量对项目的每个其他步骤都具有决定性的影响。 如果数据不好,那么每种机器学习模型似乎都是不好的选择。 如果数据很好,那么即使其他规则也可以工作 。 我经常听到人们说数据科学家花80%的时间查找和处理数据,现在我知道他们说的是实话。

When I first started working, I often asked my colleagues what algorithms/libraries do they use. These days, the very first questions I ask is: “How did you get your training data?”

刚开始工作时,我经常问我的同事他们使用什么算法/库。 这些天,我首先要问的问题是:“您是如何获得训练数据的?”

演算法 (Algorithms)

培训:让我们建立模型。 现实生活:让我们建立管道。 (Training: let’s build models. Real-life: let’s build pipelines.)

In training, the data size is small enough to fit into a single machine, and the project scope is narrow enough to be compressed into a single notebook. Thus, a few lines of code with the help of Pandas, Pytorch Dataloader and the likes can get my data ready for the model.

在训练中,数据大小足够小以适合单个计算机,项目范围也很窄以可以压缩到单个笔记本中。 因此,在Pandas,Pytorch Dataloader等工具的帮助下,几行代码即可为模型准备好我的数据。

In real life, things are much more completed. The data size and project complexity require me to handle data loading, data processing, model evaluation, etc., each with a separate component. Sometimes I will also need to worry about setting up machines, scheduling, data versioning, code versioning, etc. As illustrated in this paper by Google, machine learning is just a very small component of the projects, so, it’s best not to focus too much on in at the start.

在现实生活中,事情要完成得多。 数据的大小和项目的复杂性要求我处理数据加载,数据处理,模型评估等工作,每个工作都有单独的组件。 有时,我还需要担心机器的设置,调度,数据版本控制,代码版本控制等。正如Google在本文中所说明的那样, 机器学习只是项目的很小一部分 ,因此,最好不要过于专注一开始有很多内容。

Image for post
paper. 本文

Over time, I have come to a practice that when starting a project, the top priority is quickly set up a full end-to-end pipeline and test it with a small dataset. At this point, I only need things to run, and the performance is not yet my concern. This applies not only to the machine learning components but also to other parts of the pipeline. Once things are up-and-running, areas that need improvement can be identified, and worked on, one at a time. This helps me easily pin-point the bottleneck, and do better planning for the project.

随着时间的流逝,我开始实践一种做法,即在启动项目时, 最重要的是快速建立完整的端到端管道 ,并使用一个小的数据集对其进行测试。 在这一点上,我只需要运行就可以了,性能还不是我所关心的。 这不仅适用于机器学习组件,而且还适用于管道的其他部分。 一旦一切就绪,就可以一次确定并改进需要改进的领域。 这可以帮助我轻松查明瓶颈,并为项目做更好的计划。

培训:让我们使用高级模型,例如专家。 现实生活:让我成为菜鸟。 (Training: let’s use advanced models, like the pros. Real-life: let me be noob.)

I used to think that the use of advanced models, or even more so, building them from scratch, is a sign of expertise. Therefore, in training, to polish my portfolio, or to gain that 0.01 score in the leaderboard, I tend to go for more complex, fancy models. I have time, and the dataset is small anyway, so why not just go for it?

我曾经认为使用高级模型,甚至更多,从头开始构建它们,是专业知识的标志。 因此,在培训中,为了完善自己的作品集或在排行榜中获得0.01分,我倾向于选择更复杂,更漂亮的模型。 我有时间,而且数据集仍然很小,那么为什么不就去做呢?

Practicing that in real life, however, can be a sign of stupidity. A complex model is much less applicable for many reasons:

然而,在现实生活中练习可能是愚蠢的迹象。 复杂模型由于许多原因而不太适用:

  • It costs the company more money to train.

    培训花费了公司更多的钱。
  • It costs more time to set up and even more so for each training iteration. This reduces the number of feedback loops for me to identify issues and make improvements.

    设置花费更多的时间,每次训练迭代花费的时间甚至更多。 这减少了我发现问题并进行改进的反馈回路的数量。
  • Its results are more difficult to explain. Imagine if I used XLNet for spam e-mail detection, the results turn bad and I have no idea, among those 340M parameters, what went wrong.

    其结果更难以解释。 想象一下,如果我使用XLNet来检测垃圾邮件,结果将变得很糟糕,而且我不知道在这340M参数中出了什么问题。
  • My bottleneck maybe something else, not the model. Imagine if I spent 2 weeks building multi-layers ensembles models, just to realize that the bad performance is because my ground truth data was wrongly labeled.

    我的瓶颈可能是其他问题,而不是模型。 想象一下,如果我花了2周的时间来构建多层集成模型,只是为了意识到性能不佳是因为我的地面真实数据被错误地标记了。

As such, my current routine for selecting machine learning models is as below:

因此,我目前用于选择机器学习模型的例程如下:

Image for post
My routine for selecting the algorithm.
我选择算法的例程。

Some other lessons that I learned regarding the algorithm:

我从算法中学到的其他一些课程:

  • Domain knowledge and good data can beat any model.

    领域知识和良好数据可以击败任何模型。

  • The goal is not to build models, but to solve problems, within reasonable time and resources, and produce reasonable results.

    目标不是建立模型,而是在合理的时间和资源内解决问题并产生合理的结果。

评价 (Evaluation)

培训:预测完成,工作完成。 现实生活:没那么快! (Training: prediction completed, the job is done. Real-life: not so fast!)

In training, the evaluation metrics are clearly defined together with the problem. There’s also a leaderboard to compare the performance of my models with others. Thus, the moment my model spit out the output, I can submit results right away and voila, the job’s done.

在培训中,评估指标应与问题一起明确定义。 还有一个排行榜,可以比较我的模型与其他模型的性能。 因此,当我的模型输出输出时,我可以立即提交结果,瞧,工作就完成了。

In real life, things don’t stop when the prediction is completed.

在现实生活中,预测完成后事情不会停止。

  • More often than not, I need to define the evaluation metrics myself, which is not always trivial.

    通常,我需要自己定义评估指标,这并不总是琐碎的。
  • Input data is usually noisy, so results from train-test split can be unreliable. Thus, a more convincing evaluation method is required, such as manual checking, or A/B testing.

    输入数据通常比较嘈杂,因此火车测试拆分的结果可能不可靠。 因此,需要一种更具说服力的评估方法,例如手动检查或A / B测试。
  • Good model performance is one thing, getting it approved by the manager is another. Proper communication, good storytelling, together with concrete supporting data, are always needed.

    良好的模型性能是一回事,而获得经理批准则是另一回事。 始终需要正确的沟通,良好的故事叙述以及具体的支持数据。
  • The output data/model/service needs to be properly passed on to the next process of the pipeline, together with proper documentation.

    需要将输出数据/模型/服务与适当的文档一起正确传递到管道的下一个流程。

结论 (Conclusion)

Many courses just focus on machine learning algorithms, and most competitions are just about building machine learning models. Yet, the algorithms and models are just very small parts of a real-life project. A well-executed project also needs proper problem statement, good data sources, solid engineering structure, smooth data pipeline, explainable results, and reliable, convincing evaluation metrics.

许多课程只着重于机器学习算法,而大多数比赛只是关于建立机器学习模型。 然而,算法和模型只是现实生活项目中的很小一部分。 一个执行良好的项目还需要适当的问题陈述,良好的数据源,可靠的工程结构,流畅的数据管道,可解释的结果以及可靠的,令人信服的评估指标。

Above is just one of the many valuable lessons that I learned after the first year working as a Machine Learning Engineer. Check out some of my previous posts here and here, and stay tuned for more interesting stories to come.

以上只是我作为机器学习工程师第一年后学到的许多有价值的课程之一。 在这里这里查看我以前的一些帖子,并继续关注更多有趣的故事。

Thank you for reading.

感谢您的阅读。

翻译自: https://towardsdatascience.com/machine-learning-stuff-schools-do-not-teach-f2869f964b78

计算机机器学习 考研学校

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值