数据科学的第一步 (Data Science First Steps)

With the popularity and demand for data scientists, and the well-documented shortage of skilled labor, more people are interested in data science as a career. Over time, I’ve gotten an increasingly large number of questions regarding how to start out as a data scientist. Like many other roles, landing the first job is typically the hardest, as having some experience under your belt is mandatory for many employers. This can create a vicious catch 22: how do you land your first job if they all require prior experience?

随着数据科学家的普及和需求以及有据可查的熟练劳动力的短缺 越来越多的人对数据科学作为一种职业感兴趣。 随着时间的流逝,关于如何开始成为数据科学家的问题越来越多。 像许多其他角色一样,找到第一份工作通常是最困难的,因为许多雇主都必须具备一定的经验。 这可能会带来恶性陷阱22:如果他们都需要先验经验,那么您将如何找到第一份工作?

In this post, I’ll try to give you some advice — based on my own experience moving into data science several years back, and my current experience managing a data science department, interviewing dozens of candidates and reviewing hundreds of applications every year.


你的背景是什么? (What’s your background?)

From my experience, people trying to start a career in data science can be split into three relatively distinct groups. It’s important to identify which of these you are most similar to, in order to figure out your best next steps.

根据我的经验,尝试开始从事数据科学事业的人们可以分为三个相对不同的群体。 重要的是要确定与您最相似的哪个,以便找出最佳的下一步。

  1. The STEM career change — These are people with an advanced academic degree in a technical/scientific field who may already have several years’ work experience in an adjacent field. As the hype around data science has grown, they’ve started considering the option of transitioning. They typically have a strong mathematics and research background and can follow the linear algebra and statistics behind machine learning models. They have experience reading academic papers and aren’t intimidated by the formulas. Their transferable skills can help them become good data scientists relatively quickly.

    STEM职业变更 -这些人是在技术/科学领域具有较高学历的人,他们可能已经在相邻领域拥有数年的工作经验。 随着围绕数据科学的炒作越来越多,他们已经开始考虑过渡的选择。 他们通常具有强大的数学和研究背景,并且可以遵循机器学习模型背后的线性代数和统计信息。 他们具有阅读学术论文的经验,并且不受公式的约束。 他们的可转让技能可以帮助他们相对Swift地成为优秀的数据科学家。

  2. The data science new grad — While it’s taken a few years, universities have started to address the industry demand and various faculties are now offering MSc programs in data science. Depending on the university, these might include the statistics, electrical engineering or industrial engineering departments. While these degrees can’t cover everything, they’re quickly becoming a gold standard for comprehensive data science training that a 3- or 6-month bootcamp can’t meet. A good program will also include a thesis (and publication/s), which gives the employer an opportunity to discuss your work in greater detail. Whenever interviewing new grads I deep dive into their thesis, making sure they understand alternative approaches, discuss why they made certain decisions and ascertain how they handle feedback. Due to the scope of a thesis, it’s usually a great way to evaluate how someone performs research and how well they really know their material, in a way that a Kaggle project they did a while back can’t achieve.

    数据科学新毕业生 —尽管花费了几年时间,但大学已经开始满足行业需求,并且各个学院现在都提供数据科学理学硕士课程。 根据大学的不同,这些可能包括统计,电气工程或工业工程系。 尽管这些学位不能涵盖所有内容,但它们正Swift成为进行3个月或6个月训练营无法满足的全面数据科学培训的金标准。 一个好的程序还将包括一篇论文(和出版物),使雇主有机会更详细地讨论您的工作。 每当采访新毕业生时,我都会深入研究他们的论文,确保他们了解替代方法,讨论他们为什么做出某些决定并确定他们如何处理反馈。 由于论文的范围,它通常是评估某人如何进行研究以及他们对自己的材料的真正了解程度的一种好方法,而这是他们前一段时间所做的Kaggle项目无法实现的。

  3. The optimist — This is someone who hasn’t gone through formal data science training nor do they have an extensive statistics/math background. They may have several years’ experience in data analytics within a specific vertical (finance, healthcare, etc) and want to complement their current skills to gradually move into a data science role. In the past, several people turned to me for consultation about their possibility to be a data scientist in fintech or some other specific vertical. While business acumen and experience in the vertical is important, this is the wrong mental mindset. The commonality between data science roles in various verticals is significant — the tools and algorithms solve generic mathematical problems, not vertical-specific ones. It’s easier to teach a good data scientist about a new domain than it is to train a business analyst with domain knowledge how to program, teach them statistics and machine learning. If you want to be a data scientist — you want to be just that, not a fintech data scientist.

    乐观主义者 -这是一个没有经过正规数据科学培训也没有广泛的统计学/数学背景的人。 他们可能在特定行业(财务,医疗保健等)的数据分析领域拥有数年的经验,并希望补充其当前的技能以逐步担任数据科学的角色。 过去,有几个人向我咨询以寻求成为金融科技或其他特定领域的数据科学家的可能性。 尽管业务敏锐度和垂直行业经验很重要,但这是错误的思维方式。 各个垂直领域的数据科学角色之间的共通性很重要-这些工具和算法可以解决通用的数学问题,而不是特定于垂直领域的问题。 向优秀的数据科学家讲授新领域要比培训具有领域知识的业务分析师如何编程,教他们进行统计和机器学习要容易得多。 如果您想成为一名数据科学家–您就是那样,而不是金融科技数据科学家。

If you’ve read this far, you probably know that there are a lot of online courses teaching everything data science related. While those courses are fundamental and deliver a ton of content, the vast majority try to give the most practical information as fast as possible. This typically means you’re going to learn a lot of machine learning models but only get the 30K foot explanation of how the algorithm actually works. Many courses won’t complicate matters with complex math so they can remain accessible to as big an audience as possible. While it’s definitely possible to train models and ‘do data science’ without understanding the intricacies of the algorithm, your capabilities will be limited. With the trend of automated ML picking up, plugging in an algorithm and trying out a few standard options won’t require a data scientist in the near future. Like many other professions, data scientists too will need to keep an edge over automated systems to keep their jobs, which will typically mean a much deeper understanding of the algorithms.

如果您已经阅读了到目前为止,您可能会知道有很多在线课程教授与数据科学相关的所有知识。 虽然这些课程是基础课程并提供大量内容,但绝大多数课程都试图尽快提供最实用的信息。 这通常意味着您将要学习很多机器学习模型,但只能获得该算法实际工作原理的30K英尺解释。 许多课程不会使复杂的数学变得复杂,因此可以让尽可能多的观众接触到它们。 虽然绝对有可能在不了解算法复杂性的情况下训练模型和“做数据科学”,但您的能力将受到限制。 随着机器学习自动化的趋势,插入算法并尝试一些标准选项在不久的将来将不再需要数据科学家。 像许多其他专业一样,数据科学家也需要在自动化系统上保持优势,以保持其工作,这通常意味着对算法有更深入的了解。

Due to the very accessible nature of data science training and lack of standard required qualifications to practice data science, anyone who has undergone a 50 hour course can self-appoint themselves as a data scientist. As elsewhere, when a role is in high demand, supply will increase to meet the demand and an influx of new candidates will start moving in. To have a serious chance at making it in the field, a significant investment of time is required.

由于数据科学培训的易用性以及缺乏实践数据科学所需的标准资格,因此,经过50小时课程的任何人都可以自行任命自己为数据科学家。 与其他地方一样,当一个角色的需求很高时,供应将增加以满足需求,并且将涌入新的候选人。要想在该领域取得成功的机会很大,就需要大量的时间投入。

如何闯入数据科学 (How to break into data science)

There are different ways to gain the minimal experience and knowledge to get your first data science position. When hiring for a junior position, the interviewer is going to look for a few things:

有多种方法可以获取最少的经验和知识,从而获得您的第一个数据科学职位。 招聘初级职位时,面试官会寻找一些东西:

  • Do you understand the fundamentals and theory of machine learning?

  • Do you have the necessary coding skills (usually Python or R)?

  • Can you demonstrate both of these points (e.g. walk the walk, not just talk the talk)?


As a candidate, you need to remember that the company’s loss function is asymmetric — hiring a bad candidate can have a much worse outcome than turning down a good hire. This means that companies are going to be cautious about taking risks on someone lacking a track record. You need to help the hiring manager as much as possible to demonstrate that you’re a low-risk and high-potential hire. This also means that your chances may be relatively low and you need to be emotionally prepared for a lot of rejections before getting an offer.

作为应聘者,您需要记住,公司的亏损职能是不对称的-聘用糟糕的应聘者比拒绝优秀的聘用者要糟糕得多。 这意味着公司将谨慎对待缺乏良好业绩记录的人。 您需要尽可能地帮助招聘经理,以证明您是低风险和高潜力的员工。 这也意味着您的机会可能相对较低,在获得要约之前,您需要为许多拒绝而在情绪上做好准备。

There are 3 main ways to gain the theoretical knowledge and expertise necessary for your first role, and they can be combined in various methods:


  1. Masters Degree (with thesis) — As mentioned above, this is probably the gold standard for training today. While it can take 1–2 years, it is time well spent, especially if studying at a well known university. University pedigrees vary by location so it helps to understand what’s considered a good university in your vicinity.

    硕士学位(附论文)—如上所述,这可能是当今培训的黄金标准。 虽然可能需要1-2年,但它是花费的时间,特别是如果在著名的大学学习。 大学的血统书因地点而异,因此有助于了解您附近的一所好大学。
  2. Bootcamp — these typically run 3–6 months for full time immersive programs and much longer if they’re part-time. It’s best to pay close attention to the financial incentive the program has in regards to your future career. In some bootcamps it’s very straightforward — you pay for the training. On the other hand, the best bootcamps will also offer Income Share Agreements. In this scenario, after the bootcamp is complete you pay them a percentage of your salary only if it is above a threshold. The agreement is usually in effect for 2–4 years and is capped (e.g. 1.5–2X the upfront tuition cost). In Israel, ITC and Y-Data operate in this fashion and put a bigger focus on assisting their students land their first role. Other bootcamps work by keeping you on their payroll for 2 years following the training period, during which you work on a project for their client companies (e.g. Experis Academy in Israel). The bootcamp pays your salary directly and pockets the difference between it and their outsourcing fee, while typically offering the employee an exit clause (which covers their training expenses).

    训练营-对于全职沉浸式课程,这些课程通常需要运行3-6个月,如果是兼职课程,则需要更长的时间。 最好密切注意该计划对您未来职业的经济激励。 在某些训练营中,这非常简单-您需要支付培训费用。 另一方面,最好的训练营也将提供收入分成协议。 在这种情况下,新手训练营结束后,您仅需支付工资的一定百分比即可,仅支付薪水的一部分。 该协议通常有效期为2至4年,并且有上限(例如,前期学费的1.5至2倍)。 在以色列, ITCY-Data以这种方式开展业务,并将重点更多地放在帮助他们的学生获得他们的第一个角色上。 其他训练营的工作方式是在培训期结束后将您的薪水保持在两年内,在此期间,您为他们的客户公司(例如以色列的Experis Academy )从事一个项目。 训练营直接支付您的薪水,并收取薪水与其外包费用之间的差额,同时通常向员工提供退出条款(涵盖培训费用)。

    Generally speaking, these bootcamps cover a wide range of topics and include theoretical machine learning knowledge, coding skills, statistics and (at least one) capstone project. As you can understand, different bootcamps have various levels of incentive to ensure your successful placement following their training. In some cases, it may be worthwhile to invest the time in a bootcamp, even if a fair chunk of the material is already known just to benefit from their assistance in landing the first position.

    一般来说,这些训练营涵盖了广泛的主题,包括理论上的机器学习知识,编码技能,统计数据和(至少一个)顶点项目。 如您所知,不同的训练营有不同程度的激励机制,以确保您在训练后能够成功入职。 在某些情况下,将时间花在训练营上是值得的,即使已经知道相当一部分材料只是受益于他们帮助他们登上第一个职位。

  3. Online courses — the amount and quality of these courses has been transformational, enabling anyone around the world to learn from the top experts. The fact that such high quality content is now freely accessible to anyone has dramatically reduced the barrier to entry. At a very high level one can separate these courses into two types — intro level courses that try to cover a bit of everything in machine learning, and more advanced courses that dive deeper into specific areas. Several of the popular intro level courses can be completed in under 80 hours of dedicated effort. While this does require dedication (especially for something doing this on top of a full time job), it’s a relatively trivial time investment compared to many other high-paying professions (e.g. think of the time required to become a pilot, lawyer or doctor). I’ve seen a few applicants who put down Andrew Ng’s infamous Machine Learning course as their single training in the field. I agree that it’s a great course (it was the first one I took when transitioning to data science), but it was definitely not sufficient to qualify as a data scientist. You should be very wary of any course that claims to teach you the A-Z of ML. They might be a great intro into the field, but you should treat them as the first step in a long journey.

    在线课程-这些课程的数量和质量已经发生了改变,使世界各地的所有人都可以向顶尖专家学习。 现在任何人都可以自由访问这样高质量的内容,这一事实大大减少了进入的障碍。 在非常高的层次上,可以将这些课程分为两种类型:入门级课程,尝试涵盖机器学习的所有内容,以及更高级的课程,深入研究特定领域。 不到80小时的投入,即可完成几门热门的入门级课程。 尽管这确实需要奉献精神(尤其是在全职工作之上做某事),但与许多其他高薪职业相比,这是相对微不足道的时间投入(例如,考虑成为飞行员,律师或医生所需的时间) 。 我见过一些申请者将Andrew Ng臭名昭著的机器学习课程作为他们在该领域的唯一培训。 我同意这是一门很棒的课程(这是我过渡到数据科学的第一门课程),但是绝对不足以成为数据科学家。 您应该警惕任何声称可以教您ML AZ的课程。 它们可能是该领域的不错入门,但是您应该将它们视为长途旅行的第一步。

这些趋势对我意味着什么? (What do these trends mean for me?)

The STEM career change — Of the three paths this is probably the fastest one, and if you invest enough time, your chances of success are pretty good. Additionally, the closer your background is to data science, the better. Depending on your background, you may already have most of the mathematical background and need to invest more heavily in your programming skills. As an employer, discussing someone’s thesis or dissertation can help show how well they grasp complex research subjects. Can they get into the weeds and back up to 30K feet quickly? Do they really understand why they made different decisions or used certain algorithms? What value might their research have? While strong research capabilities aren’t enough for a data scientist, checking these marks can help de-risk a new candidate, especially one with limited direct experience in the field. As someone who went through this path several years back (my MSc was in applied physics), I continue to see how my education gives me a different viewpoint in solving problems compared to colleagues with math, statistics, economics or biology backgrounds.

STEM职业变更 -在这三种途径中,这可能是最快的途径,而且如果您投入足够的时间,那么成功的机会就很大。 此外,您的背景与数据科学越近越好。 根据您的背景,您可能已经拥有大多数数学背景,并且需要在编程技能上投入更多的精力。 作为雇主,讨论某人的论文或论文可以帮助证明他们掌握复杂研究课题的能力。 他们可以进入杂草并Swift回到30K英尺吗? 他们真的了解为什么他们做出不同的决定或使用某些算法吗? 他们的研究可能有什么价值? 尽管强大的研究能力不足以吸引数据科学家,但检查这些标记可以帮助降低新候选人的风险,尤其是在该领域中缺乏直接经验的候选人。 作为几年前曾经走过这条路的人(我的硕士是应用物理学的),我继续看到与具有数学,统计学,经济学或生物学背景的同事相比,我的教育对解决问题有何不同的看法。

Someone going through this path also has the benefit of being able to pick up more advanced material quickly. Once you’ve gotten your feet wet, you’ll want to understand the algorithms to a great extent and develop an insight for the hyperparameters. This is a lot easier if you’re accustomed to advanced math.

沿这条路走的人还具有能够快速拾取更多高级材料的好处。 一旦弄湿了,您将需要在很大程度上理解算法并深入了解超参数。 如果您习惯了高级数学,这会容易得多。

Pro Tip — if you’re at all able to highlight data science / machine learning work you’ve done before you officially started as a data scientist, you might be able to get additional years of your experience recognized as relevant when negotiating compensation. While you don’t want to embellish your past work, it is useful to point out your programming experience, data analytics, advanced statistics, experimental design, algorithm development or other adjacent types of work.

专家提示 -如果您完全能够突出您在正式成为数据科学家之前就已经完成的数据科学/机器学习工作,那么在进行薪酬谈判时,您可能会获得更多与经验相关的经验。 虽然您不想修饰过去的工作,但是指出您的编程经验,数据分析,高级统计,实验设计,算法开发或其他相邻类型的工作很有用。

The data science new grad — assuming you still have some time to complete your studies, look for any extra-curricular activities that can help you gain experience. Ideally, this would involve an internship within a data science team. One of my past employers would regularly bring in interns each summer and make offers at the end of the season to the most promising ones. This was a great win-win and a large portion of the company’s hires came through that program. If an internship isn’t possible, your university might have a capstone project you can invest in. At Riskified we’ve collaborated with a local university, giving one of their teams an open project to work on with our guidance as their capstone. If the students invest and do genuinely good work (i.e. not just to pass their course, but something that would qualify as good work in the company), we could be interested in hiring or at the very least writing a letter of recommendation for future employers.

数据科学专业的新毕业生 —假设您还有时间完成学习,那么寻找可以帮助您获得经验的任何课外活动。 理想情况下,这需要在数据科学团队中进行实习。 我以前的雇主之一会在每个夏天定期聘用实习生,并在本赛季结束时向最有前途的雇主提出要约。 这是一次双赢,公司的大部分员工都是通过该计划获得的。 如果无法进行实习,则您的大学可能有一个您可以投资的顶峰项目。在Riskified中,我们与当地一所大学合作,为他们的团队之一提供了一个开放的项目,以我们的指导作为顶峰。 如果学生投资并做真正的好工作(即不仅要通过他们的课程,而且要在公司中有良好的工作资格),我们可能会对招聘感兴趣,或者至少写一封给未来雇主的推荐信。

Pro Tip — When working in data science (as in almost any career), you’ll need to be able to explain things to people outside your domain (side note — never make the mistake of thinking non-technical people aren’t as smart as you). During your interviews, you’re going to be asked quite a bit about your thesis. Find a smart friend with limited knowledge in machine learning to ask you about this. Can you explain to them what you did and how it was different from existing solutions? I’ve interviewed several new grads who could describe all the details of their research but were stumped by some high level, introduction questions (e.g. why is this research important?).

专家提示 -在数据科学领域工作(几乎在任何职业中),您都需要能够向自己领域以外的人解释事物(旁注-切勿误以为非技术人员不那么聪明就像你一样)。 在面试中,您将被问及有关论文的很多信息。 寻找一个在机器学习方面知识有限的聪明朋友,向您询问有关此事。 您能否向他们解释您做了什么以及与现有解决方案有何不同? 我采访了几位新毕业生,他们可以描述他们研究的所有细节,但被一些高级的入门问题所困扰(例如,为什么这项研究很重要?)。

Finally, don’t forget that success requires lifelong learning and you’ve only completed one phase of your training so far. Continuing to learn on the job is just as important and may be more difficult as it isn’t as structured.

最后,不要忘记,成功需要终身学习,到目前为止,您只完成了培训的一个阶段。 继续在工作中学习同样重要,并且可能因为没有那么结构化而变得更加困难。

The optimists — There are a lot of people learning to become data scientists through online courses and bootcamps. Competition is stiff and you’re not going to get a job in the field after investing 80 hours. Employers are going to look at the duration of your classes/bootcamp and how familiar they are — nano-degrees on EdX or a 6-month bootcamp are going to be a lot more impressive than a single course on Udemy or Coursera.

乐观主义者 -有很多人通过在线课程和训练营学习成为数据科学家。 竞争非常激烈,您在投入80个小时后就不会在野外找到工作。 雇主将查看您的课程/训练营的持续时间,以及他们的熟悉程度-在EdX或6个月的训练营上的纳米学位将比在Udemy或Coursera上的一门课程印象深刻。

In my opinion, the window of opportunity to transition into data science without extensive formal training (e.g. self-taught online courses) is shrinking. While it’s still doable, you need to realize that there are a lot of people with shallow knowledge of the field and landing your first job will require a lot more (as of September 2020 Andrew Ng’s course has had 3.5M enrolled students). If you want to go down this path, it will probably still take you several months (read: hundreds of hours) of course work and hands-on projects with a good dose of luck.

我认为,无需大量的正式培训(例如,自学成才的在线课程),即可过渡到数据科学的机会之窗正在缩小。 尽管它仍然可行,但您需要认识到很多人对该领域的知识很浅,而找到第一份工作将需要更多的知识(截至2020年9月,吴恩达的课程招收了350万名学生)。 如果您想走这条路,那么可能还需要花费几个月(阅读:数百小时)的课程工作和动手项目,而且运气很好。

Pro Tip — if you can, consider bootcamps that have a proven track record of alumni starting data science positions (if their financial incentive depends on this, even better). While several months of full-time studying might be more than the investment you were considering it could make all the difference.

专家提示 -如果可以的话,请考虑具有在校毕业生担任数据科学职位的可靠记录的训练营(如果他们的经济动机取决于此,甚至更好)。 虽然几个月的全日制学习可能比您考虑的投资要多,但它可能会带来很大的不同。

Due to the slow but steady autoML trend, it also means that you need to keep studying and increasing your expertise after you’ve landed your first role. You always need to stay a few years ahead of automation and a little bit of paranoia can be healthy for long-term job security.

由于autoML趋势缓慢但稳定,这也意味着您在上任后需要继续学习并增强专业知识。 您总是需要在自动化方面领先几年,对于长期的工作安全而言,有些偏执可能是健康的。

最后的想法 (Final thoughts)

Compared to other high income, high demand professions, you don’t have to spend several years in medical school or log a thousand flight hours before you’re allowed to practice data science. While the demand for data scientists is high, most of that demand is for very skilled individuals who can demonstrate their value. You need to keep in mind that despite the lack of regulatory barriers, market forces still exist and companies won’t pay top dollar for someone with limited experience. More so, new data scientists require a lot of attention, training and support from more experienced data scientists. As the first few months are almost all investment by the company, it could take a year until a new data scientist’s contribution is back to zero. Paradoxically, this problem is exacerbated by the lack of experienced data scientists — they are really needed working on problems now and can only spend a certain amount of time training new people.

与其他高收入,高需求的职业相比,您无需在医学院学习数年或记录一千个飞行小时,就可以练习数据科学。 尽管对数据科学家的需求很高,但其中大部分需求是能够证明自己价值的非常熟练的个人。 您需要记住,尽管缺乏监管障碍,但市场力量仍然存在,公司不会为经验有限的人支付高昂的费用。 更重要的是,新数据科学家需要更多有经验的数据科学家的关注,培训和支持。 由于前几个月几乎是公司的全部投资,因此可能需要一年时间才能让新的数据科学家的贡献降为零。 矛盾的是,缺乏经验丰富的数据科学家使这个问题更加恶化了,他们现在确实需要解决问题,并且只能花一定时间培训新人。

It’s not an easy path but it’s definitely rewarding. The world needs more great data scientists, so get to it!

这不是一条简单的道路,但肯定会有所收获。 世界需要更多伟大的数据科学家,因此请努力!

