数据科学学习心得_当我开始学习数据科学时我希望知道的5件事

weixin_26750511

于 2020-10-10 15:56:22 发布

阅读量1.1k

点赞数

文章标签： python 人工智能 java 机器学习大数据

原文链接：https://towardsdatascience.com/5-things-i-wish-i-knew-when-i-started-learning-data-science-24d6f9a2d1e0

版权

数据科学学习心得

重点(Top highlight)

For two years now, I’ve been studying data science concepts on my own, and through this journey, I’ve gained many insights that I want to share with new data scientists who are starting out.

两年来，我一直在独自研究数据科学概念，在此过程中，我获得了许多见解，我想与刚开始的新数据科学家分享。

Feel free to take what you want from this article, but I’m simply sharing my opinion for those who are a little lost and would like some more guidance. With that being said, here are my 5 THINGS I WISH I KNEW when I started learning data science:

随意从本文中获取您想要的东西，但是我只是向那些有点迷茫并希望获得更多指导的人分享我的观点。话虽这么说，当我开始学习数据科学时，我希望知道以下五件事：

1)在成为一名优秀的数据科学家之前，先成为一名优秀的程序员和一名统计学家。 (1) Try to be a good programmer and a good statistician before being a good data scientist.)

If you’ve read my older articles, you’ve probably already heard me say this — a data scientist is really a modern statistician who leverages programming to implement statistical methods.

如果您读过我的较早的文章，您可能已经听说过我说的话- 数据科学家确实是一位现代统计学家，他利用编程来实现统计方法。

Understanding the fundamentals will make your life a lot easier and actually save you time in the long run. Almost all machine learning concepts and algorithms are based on statistics and probability, and on top of that, many other data science concepts, like A/B testing, are purely statistical as well.

了解基本原理将使您的生活更加轻松，并且从长远来看实际上可以节省您的时间。几乎所有的机器学习概念和算法都基于统计和概率，最重要的是，许多其他数据科学概念(例如A / B测试)也完全是统计的。

Ultimately, how good you are as a data scientist is limited by how knowledgable you are in programming and statistics.

归根结底，您作为一名数据科学家的出色程度受到您对编程和统计知识的了解程度的限制。

Check out my previous article, “How I’d Learn Data Science if I Could Start Over” for more guidance on this point.

请查看我以前的文章“如果可以重新开始，我将如何学习数据科学”，以获取有关这一点的更多指导。

TLDR: Have a good programming and statistics foundation before learning anything else. It will save you much more time in the long run.

TLDR：在学习其他内容之前，必须具有良好的编程和统计基础。 从长远来看，它将为您节省更多时间。

2)将更少的时间花在在线训练营上，而将更多的时间花在个人数据科学项目上。 (2) Spend less time on online bootcamps, and more time on personal data science projects.)

I know this may be a controversial statement to some of you, so let me preface thing by saying a couple of things:

我知道这对某些人来说可能是一个有争议的声明，因此，让我通过说几句话来作为开头：

This is entirely based on my anecdotal evidence as well as my observations from my peers.
这完全基于我的轶事证据以及我从同行那里得到的观察。
There are obviously some amazing online courses/bootcamps that aren’t included in my generalized statement, like deeplearning.ai’s courses.
显然，我的概括性陈述中没有包含一些出色的在线课程/训练营，例如deeplearning.ai的课程。
I also want to say that it’s better that you’re doing a bootcamp if the alternative is nothing.
我还想说，如果没有其他选择，那么最好进行一次训练营。

话虽如此，这是在线训练营的几个问题。 (That being said, here are several problems with online bootcamps.)

They tend to be very surface level in terms of the depth of material, and not only that, but they also tend to give a false sense of understanding of the material that was learned.
就材料的深度而言，它们往往非常接近表面水平，不仅如此，而且它们还倾向于给人一种对所学材料的理解的错误感。
They also tend to not be very good for retaining information. I think you can agree that the more time you spend studying a subject, the more likely you are to retain information. The problem with these bootcamps, especially the ones that are advertised as “becoming an expert in 5 weeks”, is that they aren’t giving you enough time to really sink in what you’re learning.
它们对于保留信息也往往不是很好。我认为您可以同意，您花费越多的时间学习某个主题，就越有可能保留信息。这些训练营，尤其是那些宣传为“在5周内成为专家”的训练营，存在的问题是，它们没有给您足够的时间来真正沉浸于您所学的知识中。
Lastly, they generally tend not to be challenging enough. A lot of bootcamps and courses simply ask you to follow along and copy their code, which doesn’t require you to think critically and in-depth.
最后，他们通常没有足够的挑战性。许多训练营和课程只是要求您遵循并复制他们的代码，而无需您进行批判和深入的思考。

为什么您应该从事个人数据科学项目。 (Why you should be working on personal data science projects.)

Personal data science projects are a great way to learn because you’ll be forced to think critically about the problem and solution all on your own.

个人数据科学项目是一种学习的好方法，因为您将不得不独自思考问题和解决方案。

Through this, you’ll learn so much more than any bootcamp can teach you. You’ll learn how to ask the right questions, how to Google the right questions, how to approach a data science project that works for you, how to be methodical, etc…

这样，您将学到的知识远超过任何训练营都能教给您的。您将学到如何提出正确的问题，如何用Google搜索正确的问题，如何进行对您有用的数据科学项目，如何变得有条不紊，等等。

By being more invested in your own project, you’ll also feel more motivated to learn more and invest more time, creating a positive feedback loop.

通过更多地投资于自己的项目，您也会更有动力去学习更多知识和投入更多时间，从而建立积极的反馈循环。

TLDR: Spend less time doing data science bootcamps and more time working on personal data science projects.

TLDR：花更少的时间进行数据科学训练营，而将更多的时间用于个人数据科学项目。

Need some ideas to get started? Check out my article, “14 Data Science Projects to do During Your 14 Day Quarantine”.

需要一些想法上手吗？ 查看我的文章“在14天隔离期间要执行的14个数据科学项目”。

3)专注于精选的几种工具，并善于运用。 (3) Focus on a select few tools and be really good at them.)

There are so many data science packages and tools out there, and that’s cool because you get to personalize your data science toolkit in a way that works for you.

那里有很多数据科学软件包和工具，这很酷，因为您可以采用适合自己的方式个性化数据科学工具箱。

However, it’s easy to get carried away in wanting to learn as many packages and tools as possible. Don’t make this mistake.

但是，想要学习尽可能多的软件包和工具很容易发疯。不要犯这个错误。

You’ll be much better off being extremely fluent in a few tools than scratching the surface with several tools that you’ve barely spent any time using. (Having a laundry list of skills and tools on your resume should not be the end goal!)

如果您能精通一些工具，则比使用几乎没有用过的几种工具刮擦表面要好得多。 (在简历中列出所有技能和工具不应该是最终目标！)

To give an example, there are several great data visualization packages out there: Matplotlib, Seaborn, Plotly, Bokeh, etc… There is no need to spend your time trying to master every single one of these — it’s a waste of your precious and limited time.

举个例子，这里有几个很棒的数据可视化软件包：Matplotlib，Seaborn，Plotly，Bokeh等……您无需花费时间去尝试其中的每一个-浪费您宝贵的资源和有限的资源时间。

Another example, if you want to manipulate data with Pandas, be really good with Pandas. If you’re more of a NumPy type of guy, go for it. Yes, ideally you’d like to be good at Pandas and NumPy, but my point is that it’s probably a good idea to stick to one and master it, rather than constantly hopping around.

再举一个例子，如果您想使用Pandas来处理数据，那么对Pandas真的很满意。如果您更喜欢NumPy类型的人，那就去吧。是的，理想情况下，您希望擅长于Pandas和NumPy，但我的观点是，坚持并掌握它可能是一个好主意，而不是不断地跳来跳去。

The same thing goes with…

同样的事情也伴随着……

Python vs R
Python vs R
Tensorflow vs Pytorch
Tensorflow与Pytorch
Postgresql vs MySQL
PostgreSQL vs MySQL
the list goes on…
清单继续...

TLDR: Establish your data science tool kit and stick to it! Mastering 5 tools are better than barely knowing how to use 20 tools.

TLDR：建立数据科学工具包并坚持使用！ 精通5种工具比仅仅知道如何使用20种工具要好。

4)了解各种机器学习算法仅占数据科学的一小部分。 (4) Understanding the various machine learning algorithms out there only makes up a small percentage of data science.)

Personally, what got me into data science was all of the different machine learning models, how they worked, and what applications they were useful in. I probably spent at least six months learning and dabbling with several machine learning models, only to realize that it made it a fraction of what a data scientist needs to know.

就个人而言，让我进入数据科学的是所有不同的机器学习模型，它们如何工作以及它们在哪些应用中有用。我可能花了至少六个月的时间来学习和尝试几种机器学习模型，才意识到使它成为数据科学家需要知道的内容的一小部分。

Data modeling is only one part of the entire machine learning life cycle. There’s data collection, data preparation, model evaluation, model deployment, and model tuning that you need to have an understanding of as well. In fact, I would say that the majority of time is spent on data preparation, NOT data modeling (machine learning modeling).

数据建模只是整个机器学习生命周期的一部分。您还需要了解数据收集，数据准备，模型评估，模型部署和模型调整。实际上，我要说的是，大部分时间都花在了数据准备上，而不是数据建模(机器学习建模)上。

On top of that, there are a several other things that you’ll have to learn, like version control (Git), pulling data from APIs, understanding the cloud, and the list goes on.

最重要的是，您还需要学习其他几件事，例如版本控制(Git)，从API中提取数据，了解云，等等。

TLDR: Do not spend all of your time trying to master every machine learning algorithm. It only makes up a small percentage of what a data scientist needs to know.

TLDR：不要将所有时间都花在尝试掌握每种机器学习算法上。 它仅占数据科学家需要知道的内容的一小部分。

5)作为数据科学家，通常会感觉冒名顶替综合症。 (5) As a Data Scientist, it’s common to feel Imposter Syndrome.)

From the very first day when I started learning data science and to this very day, I experience Imposter Syndrome on a regular basis. But I learned that that’s completely normal.

从我开始学习数据科学的第一天到今天，我都会定期体验冒名顶替综合症。但是我知道那是完全正常的。

Why is it common and okay for data scientists to feel imposter syndrome?

为什么数据科学家感到冒名顶替综合症是普遍且可以的？

“Data Science” is such a vague term, as it is an interdisciplinary field that includes statistics, programming, mathematics, business understanding, data engineering, etc. And on top of that, there are so many synonyms of a data scientist (data analyst, data engineer, research scientist, applied scientist). My point is that you’ll never be an expert at EVERYTHING that data science encompasses, and you shouldn’t feel like you have to be.
“数据科学”是一个模糊的术语，因为它是一个跨学科领域，包括统计，编程，数学，业务理解，数据工程等。最重要的是，数据科学家的同义词非常多(数据分析师，数据工程师，研究科学家，应用科学家)。我的观点是，您永远不会成为数据科学所涵盖的一切的专家，并且您不应该感到自己必须如此。
Like everything else in programming and tech, data science is constantly evolving. 20 years ago, Pandas wasn’t even created. Tensorflow was only released 5 years ago. There’s always going to be new technologies coming out and therefore new things that you’ll have to learn.
像编程和技术中的所有其他内容一样，数据科学也在不断发展。 20年前，熊猫还没有诞生。 Tensorflow仅在5年前发布。总会有新技术问世，因此您必须学习新知识。
This kind of relates to my first point, but because you most likely won’t be an expert at EVERYTHING, that means there’s always going to be someone who’s better at the things that you spend less time on. And that’s okay too.
这与我的第一点有关，但是因为您很可能不会成为“一切”方面的专家，所以这意味着总会有人在花更少的时间做的更好。而且也可以。

TLDR: As a data scientist, you will always feel imposter syndrome, and that’s okay.

TLDR：作为数据科学家，您总是会感到冒名顶替综合症，这没关系。

谢谢阅读！ (Thanks for Reading!)

Through reading this, I hope that I was able to give you some insights and useful advice that will help clear some of the misconceptions you have and also make your data science journey a lot smoother than mine!

通过阅读本文，希望我能够为您提供一些见解和有用的建议，这将有助于消除您的一些误解，并使您的数据科学之旅比我的顺利得多！

I’ve received really good feedback for my more opinionated data science articles, which is why I wrote this. Like always, take this with a grain of salt if you disagree with anything that I said. But if you enjoyed it, please let me know what else you’d like me to write about.

我对数据科学方面的见解颇多，收到了很好的反馈，这就是为什么我写这篇文章。像往常一样，如果您不同意我所说的话，可以加一点盐。但是，如果您喜欢它，请告诉我您还想写些什么。

I wish you guys the best in your data science journey as always!

我希望你们一如既往地在您的数据科学之旅中取得最好的成绩！