数据科学家应该知道的5个共同技能

最新推荐文章于 2024-04-09 10:57:46 发布

weixin_26707803

最新推荐文章于 2024-04-09 10:57:46 发布

阅读量134

点赞数

文章标签： python java 大数据

原文链接：https://towardsdatascience.com/5-common-skills-data-scientists-should-know-3247b2b12318

版权

意见 (Opinion)

目录 (Table of Contents)

Introduction
介绍
SQL
SQL
Python or R
Python或R
Jupyter Notebook
Jupyter笔记本
Visualizations
可视化
Communication
通讯
Summary
摘要

介绍 (Introduction)

Data Science and Machine Learning can oftentimes require an overwhelming amount of skills. However, over working several years at several companies as a Data Scientist, I wanted to highlight five common skills Data Scientists should know. As a Data Scientist, you can expect to use some of these skills most likely in your career. I will be outlining SQL, Python/R, Jupyter Notebook, visualizations, and communication.

数据科学和机器学习通常需要大量的技能。但是，在数家公司作为数据科学家工作了数年之后，我想重点介绍数据科学家应了解的五种常见技能。作为数据科学家，您可以期望在职业生涯中最可能使用其中一些技能。我将概述SQL，Python / R，Jupyter Notebook，可视化和通信。

You will, of course, encounter even more required skills and beneficial skills as you work along, but I hope these serve as a good start or enhancement of where you are in your current journey as a Data Scientist.

当然，您在工作时会遇到更多必不可少的技能和有益技能，但是我希望这些可以作为您当前作为数据科学家的旅程的良好开端或增强。

SQL (SQL)

As a studying Data Scientist or even a professional Data Scientist, you may be surprised to see this first skill. It is commonly associated with Data Analysts, while the Data Scientists are focused on a programming language and Machine Learning algorithms. However, in order to start using Python, R, or your Machine Learning algorithms, you will need to gather data. The most popular skill that I can think or have personally experienced over the years as a Data Scientist (as well as from other Data Scientists I know), is SQL. Most companies have a database set up with tables that you can query. The result of the query may be the dataset that you use for your Data Science model. While usually, you do not need to be an expert in SQL to become a successful Data Scientist, you should become familiar with some key SQL concepts and commands.

作为一名正在学习的数据科学家，甚至是专业的数据科学家，您可能会惊讶地看到这种第一技能。它通常与数据分析师相关联，而数据科学家则专注于编程语言和机器学习算法。但是，为了开始使用Python，R或您的机器学习算法，您将需要收集数据。作为数据科学家( 以及我认识的其他数据科学家 )，多年来我能想到或亲身经历的最流行的技能是SQL。大多数公司的数据库都设置有可以查询的表。查询的结果可能是您用于数据科学模型的数据集。通常，您无需成为SQL专家即可成为成功的数据科学家，但您应该熟悉一些关键SQL概念和命令。

The most popular SQL concepts I have come across are:

我遇到的最流行SQL概念是：

SelectInner JoinsLeft JoinsWhereGroup ByOrder ByAliasCase WhenSubqueries Common Table Expression

Example

例

Here is an example of a simplified query I would run to obtain my Data Science dataset.

这是我将运行以获取我的数据科学数据集的简化查询的示例。

select t1.column_1, t1.column_2, t2.column_3from table_1 t1inner join table_2 t2 on t2.shared_id = t1.shared_idwhere date > ‘2020–08–01”group by t1.column_1order by t1.column_1

Python或R (Python or R)

I have learned R in school and use it for some projects, but I mainly use Python, so I will be discussing that language over R. Another tool that a Data Scientist may use is SAS. Python, however, is beneficial due to the great number of libraries or packages that include common Data Science and Machine Learning algorithms already. You can expect to use prepared libraries that cover a wide array of models like Random Forest, Logistic Regression, Decision Trees, etc. In addition to accessing a lot of great information and making your Data Science more efficient, you can work with an object-oriented format in Python (classes, methods, or functions, modules, etc.). This format helps to make your models run more efficiently as well, while also creating a scalable framework for deployment. Additionally, it helps your model to become more easily deployed when in OOP format, as you can either do it yourself or communicate with a Software Engineer, Data Engineer, or Machine Engineer for your deployment.

我在学校学习R并将其用于某些项目，但是我主要使用Python，因此我将在R上讨论该语言。数据科学家可能使用的另一种工具是SAS。但是，由于大量的库或软件包已经包含常见的数据科学和机器学习算法，因此Python是有益的。您可以期望使用准备好的库，这些库涵盖诸如随机森林，逻辑回归，决策树等广泛的模型。除了访问大量重要信息并提高数据科学效率外，您还可以使用对象-面向Python的格式( 类，方法或函数，模块等 )。这种格式有助于使您的模型也更有效地运行，同时还可以创建可扩展的框架来进行部署。另外，它可以使您以OOP格式轻松部署模型，因为您可以自己进行部署，也可以与软件工程师，数据工程师或机器工程师沟通以进行部署。

Some useful libraries in Python include, but are not limited to:

Python中一些有用的库包括但不限于：

NumPySciPyPandasKerasTensorFlowSciKit-LearnPyTorch

I probably use NumPy and Pandas the most during exploratory data analysis, SciKit-Learn during model building for Data Science, and Keras and TensorFlow for more Machine Learning and Deep Learning exercises. I have not used PyTorch myself but have heard it is quite popular.

在探索性数据分析过程中，我可能最多使用NumPy和Pandas，在数据科学模型构建过程中，我最多使用SciKit-Learn，而在更多机器学习和深度学习练习中，使用Keras和TensorFlow。我自己没有使用过PyTorch，但听说它很受欢迎。

Jupyter笔记本 (Jupyter Notebook)

Image for post — JESHOOTS.COM on JESHOOTS.COM在 Unsplash [2]. Unsplash [2]上的照片。

When using Python, you can use the popular Jupyter Notebook tool as well to organize and research your dataset and perform your main set of code. I usually use a Jupyter Notebook first when importing my dataset, applying exploratory data analysis, feature engineering, and model building. I think of it as a place that has my rough draft before finalizing my code into an OOP format that will ultimately be integrated and deployed. It is nice to comment in the cells as well as to create headlines and bullet points so that you can easily collaborate with other Data Scientists, or if you want to go back to your model in the future and have found well-documented cells.

使用Python时，您也可以使用流行的Jupyter Notebook工具来组织和研究数据集并执行主要的代码集。在导入数据集，应用探索性数据分析，特征工程和模型构建时，我通常通常先使用Jupyter Notebook。在将代码最终确定为最终将被集成和部署的OOP格式之前，我认为它是一个草稿的地方。在单元格中进行注释以及创建标题和项目符号点是很不错的，这样您就可以轻松地与其他数据科学家合作，或者如果您希望将来返回模型并找到有据可查的单元格。

Here is a useful link for the Jupyter Notebook [3]:

这是 Jupyter Notebook [3] 的有用链接 ：

可视化 (Visualizations)

Being able to visualize several parts of the Data Science process is incredibly important. You may want to visualize the business problem, dataset, and visualize the model itself. Perhaps the most popular time to visualize in Data Science is after the model is built. When you explain your results to stakeholders, you will be describing complex ideas and output that could better be explained visually. It may also help yourself and others on your team get a better idea of what the model is performing, and how it works when you add visualizations.

能够可视化数据科学过程的多个部分非常重要。您可能希望可视化业务问题，数据集，并可视化模型本身。也许在数据科学中最流行的可视化时间是在模型建立之后。当您向涉众解释结果时，您将描述可以更好地以视觉方式解释的复杂想法和输出。它还可以帮助您自己和团队中的其他人更好地了解模型的性能以及添加可视化文件时的工作方式。

Some visualization tools you can use for Data Science processes include, but are not limited to:

可用于数据科学过程的一些可视化工具包括但不限于：

TableauLookerPowerBIPython Libraries (Matplotlib, Plotly, Seaborn)

I personally tend to use Tableau and Seaborn, but all are pretty useful tools to utilize for Data Science.

我个人倾向于使用Tableau和Seaborn，但是所有这些都是用于数据科学的非常有用的工具。

通讯 (Communication)

In addition to visualizing, you can expect to communicate.

除了可视化之外，您还可以进行交流。

Communication is a skill that is not taught as much in Data Science education or in bootcamps, but is a vital skill to have as a Data Scientist.

交流是一项在数据科学教育或训练营中没有教授的技能，但作为数据科学家，这是一项至关重要的技能。

Before starting the Data Science process of coding, you will have to speak with several different stakeholders and subject-matter experts. You may have to convince them that Data Science is necessary in the first place for the specific situation. Once you are successful with that part of the process, you will then look to the benefits of the model itself, along with key points and updates, and eventually the results and impact of your model.

在开始数据科学编码过程之前，您必须与几个不同的涉众和主题专家进行交谈。您可能不得不说服他们，对于特定情况，首先需要数据科学。一旦成功完成了该过程的那一部分，您便可以查看模型本身的优势，关键点和更新，以及最终的结果和模型的影响。

You can expect to employ communication during these parts of the Data Science process:

您可以期望在数据科学过程的以下部分中使用通信：

Business problem with stakeholdersProof of concept with subject-matter expertsUpdates on modelResults and impact of the model

摘要 (Summary)

As you can see, there are several key skills that a Data Scientist should know. You could spend hours describing skills, but it is best to jump in and start learning, one day at a time, and not be overwhelmed. The five common Data Science skills you should know are:

如您所见，数据科学家应该了解一些关键技能。您可能要花数小时来描述技能，但是最好一次进入一天并开始学习，不要被淹没。您应该了解的五种常见的数据科学技能是：

SQL
SQL
Python or R
Python或R
Jupyter Notebook
Jupyter笔记本
Visualizations
可视化
Communication
通讯

Please comment down below if you can think of any common skills that you use so that other Data Scientist can either anticipate as they enter the job market, or to better themselves as current Data Scientists.

如果您能想到使用的任何常用技能，以便其他数据科学家可以在进入就业市场时有所期待，或者使自己成为现任数据科学家而变得更好，请在下面进行评论。

Thank you for reading!

感谢您的阅读！

翻译自: https://towardsdatascience.com/5-common-skills-data-scientists-should-know-3247b2b12318

weixin_26707803

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
数据科学家应该知道的5个共同技能

意见 (Opinion) 目录 (Table of Contents)Introduction 介绍 SQL SQL Python or R Python或R Jupyter Notebook Jupyter笔记本 Visualizations 可视化 Communication 通讯 Summary 摘要介绍 (Introduction)Data Science and Machin...
复制链接

扫一扫